CN116894756A

CN116894756A - Multi-core master/slave communication

Info

Publication number: CN116894756A
Application number: CN202310323495.0A
Authority: CN
Inventors: M·J·利维斯利; I·金
Original assignee: Imagination Technologies Ltd
Current assignee: Imagination Technologies Ltd
Priority date: 2022-03-30
Filing date: 2023-03-29
Publication date: 2023-10-17
Also published as: GB2617113B; GB2617113A; GB202204508D0

Abstract

The present application relates to multi-core master/slave communications. A multi-core Graphics Processing Unit (GPU) and a method of operating a GPU are provided. The GPU includes at least a first core and a second core. At least one core in a multi-core GPU includes a master unit configured to receive a set of image processing tasks, assign a first subset of tasks to a first core and a second subset of tasks to a second core, transmit the first subset to the first core and transmit the second subset to the second core.

Description

Multi-core master/slave communication

Cross Reference to Related Applications

The present application claims priority from uk patent applications GB 2204508.2 and GB 2204510.8, each filed 3/30 of 2022, which are incorporated herein by reference in their entirety.

Technical Field

The present application relates to graphics processing. In particular, the present application relates to graphics processing using multi-core graphics processing units.

Background

In computer graphics, "rendering" is the process of converting a 3D model describing a virtual scene into one or more 2D images representing views of the scene from a particular viewpoint (or viewpoints). Since this is a computationally intensive process, for typical virtual scenarios, a hardware accelerator is usually provided that exclusively performs the necessary calculations. Such hardware accelerators are known in the art as Graphics Processing Units (GPUs).

Different GPUs may have different hardware architectures reflecting different policies for performing the computations necessary for 3D rendering. One exemplary GPU uses a "tile-based delayed rendering" pipeline.

This approach separates the rendering process into two distinct phases. First, geometry data describing a 3-D model of a scene is processed to convert it from 3-D space to 2-D coordinates of an image based on a particular viewpoint to be rendered. This will be referred to as the geometry processing stage (or simply "geometry processing"). The output of this stage is the transformed geometry, which is stored in a "parameter buffer" in a so-called "parameter block".

The transformed geometry in the parameter buffer will be used to determine the "fragment". Thus, the second stage is referred to as a fragment shading or fragment processing stage. It may also be referred to as a "3D" stage, or simply as "fragment processing".

In the second stage, the transformed geometry data is read from the parameter buffer and rasterized, meaning transformed into fragments and mapped to pixels. As part of this process, a depth test is performed to determine the segment that is actually visible for each pixel (or sample location if there is no one-to-one correspondence between sample location and pixel). In a deferred rendering system, the GPU continues to retrieve texture data (including color information) for the relevant visible fragments only when the system has determined which fragments are visible. A shader program is run for each visible segment, and the shaded segments are used to determine pixel values to be displayed.

In the past, rendering work was performed in parallel on multiple cores by connecting the cores in a multi-core system (via separate dedicated connections) to a central hub. The central hub assigns work to each core and includes a shared cache accessible to all cores. For example, as processing power on each core becomes available, the central hub allocates rendering tasks to cores of the multi-core system to coordinate them to process rendering tasks in parallel.

Due to the increased speed and bandwidth of modern graphics processing units, central hub systems are no longer a practical means of enabling parallel processing. One problem faced by central hub systems is the chip space problem-the dedicated connection between the central hub and the core does not directly contribute to the processing of rendering tasks. However, they occupy chip space available for another core.

Another related problem is scalability. While additional cores may be added to the multi-core system to improve its performance, this also increases the number of dedicated connections required and the complexity of the chip layout.

There is a need to develop a multi-core GPU that more efficiently utilizes chip space and can achieve a higher degree of parallelization.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

A multi-core Graphics Processing Unit (GPU) and a method of operating a GPU are provided. The GPU includes at least a first core and a second core. At least one core in a multi-core GPU includes a master unit configured to receive a set of image processing tasks, assign a first subset of tasks to a first core and a second subset of tasks to a second core, transmit the first subset to the first core and transmit the second subset to the second core.

According to one aspect, there is provided a graphics processing unit, hereinafter GPU, comprising a plurality of cores, wherein each core of the plurality of cores comprises a slave unit configured to manage execution of image rendering tasks within the core, and wherein at least one core of the plurality of cores further comprises a master unit configured to:

receiving a set of image rendering tasks;

assigning a first subset image rendering task to a first core of the plurality of cores;

Assigning a second subset of image rendering tasks to a second core of the plurality of cores;

transmitting the first subset of image rendering tasks to a slave unit of the first core; and

transmitting the second subset of image rendering tasks to a slave unit of the second core.

The master unit is responsible for assigning and distributing work to at least the first core and the second core. The master unit may receive the set of image rendering tasks from the application driver.

The first core or the second core may comprise a master unit. Alternatively, the third core may comprise a master unit. The master unit may assign the subset image rendering task to the core it is in. When the master unit is in the third core, the master unit may assign a third subset image rendering task to the third core.

The first subset of image rendering tasks consists of different tasks than the second subset of image rendering tasks. In other words, tasks assigned to one core are not assigned to another core as well.

Each of the plurality of cores may be identical. This means that each core may contain the same components, in particular the same master unit and slave unit, which means that each slave unit and each master unit in a core has the same corresponding unit in each of the other cores. When the cores are identical, the master of all cores except one may be inactive.

Each core may include more than one master unit and at least the same number of slave units. In each core, each active master unit is responsible for assigning work to one slave unit. In the core, no two active master units will assign work to the same slave unit. For example, a first master may assign a subset of a first set of tasks to a first slave of each core, and a second master may assign a subset of a second set of tasks to a second slave of each core, the second set of tasks having a different type of task than the first set of tasks. In an example, the first set of tasks may be fragment processing tasks and the second set of tasks may be geometry processing tasks. In examples where each core includes two master units, two master units of one core may be active while master units of the other core may be inactive. Alternatively, only one master unit in a core may be active with one master unit of another core, while all remaining master units are inactive. Alternatively, the cores may each include a single master unit, and each core may include at least the same number of slave units as the active master units present in the graphics processing system. For example, if the first core and the second core include active master units, each core may include two slave units (a first slave unit and a second slave unit in each core). The master of the first core may assign work to a first slave in the core and the master of the second core may assign work to a second slave in the core.

The slave unit of the first core may be configured to transmit a first credit notification to the master unit when a task of the first subset of image rendering tasks has been processed. The slave unit of the second core may be further configured to transmit a second credit notification to the master unit when a task of the second subset of image rendering tasks has been processed. The master unit may be configured to: storing a credit number for each of the first core and the second core; when the master unit assigns the first subset of image rendering tasks to the first core, adjusting credit numbers of the first core by a first amount for each of the first subset of image rendering tasks; when the master unit assigns the second subset of image rendering tasks to the second kernel, adjusting credit numbers of the second kernel by the first amount for each of the second subset of image rendering tasks; adjusting the credit number of the first core by a second amount when the primary unit receives the first credit notification; and when the master unit receives the second credit notification, adjusting the credit number of the second core by the second amount, wherein one of the first amount and the second amount is positive and the other is negative.

By changing the credit number of each core by a set amount for each task assigned to that core, the master can keep track of the number of tasks it has assigned to each core, and by extension, which core has been assigned the most tasks. By changing the credit number of the core in the opposite way for each task that the core reports as completed, the master unit can keep track of how busy each core is at the present.

The first value and the second value may be equal in size. For example, where the first value is a positive integer, the second value may be a negative value of the integer. Depending on whether the first amount is positive or negative, a high credit number indicates that the core is busy and a low credit number indicates that the core is not busy (or vice versa). For example, when the first amount is positive, the core with the greater positive credit number is more busy than the core with the smaller positive credit number.

When the GPU includes more than two cores, the master unit may maintain credit numbers for these additional cores in the same manner.

The master unit may be configured to: assigning subsequent image rendering tasks to the slave units currently assigned the least-working core based on the credit number of each core; adjusting credit numbers of cores assigned with subsequent image rendering tasks by a first amount; and transmits the subsequent image rendering task to the slave unit of the core to which it has been assigned.

If the first amount is positive, the core with the largest negative credit number is the core currently assigned the least work. If the first amount is negative, the core with the largest positive credit number is the core currently assigned the least work.

By assigning new tasks to cores with the least pending tasks as indicated by the credit numbers of the cores, the master unit may avoid loading one core with work when the other core is done working and may become idle. Instead, the master unit may assign more work to the cores that have processed work, preventing it from becoming idle, and maintaining a better work balance between the cores. This load balancing helps to maintain parallel processing of image rendering tasks for a longer period of time, thereby improving the performance of the graphics processing unit. The credit number for each core may be initialized with the same value.

When each core has more than one slave unit, and the graphics processing unit includes one active master unit for each slave unit in the core, each active master unit may store a credit number in each core for one of the slave units. Each master unit may store credit numbers for different slave units.

The first core may include a first number of available processing units (referred to herein as PUs) configured to perform rendering operations, and the second core may include a second number of available PUs configured to perform rendering operations. The master unit may assign image rendering tasks directly related to the first and second numbers of available PUs to the first and second cores.

The master unit may assign image rendering tasks to each core that are proportional to the number of available PUs in that core. The master unit may weight credit numbers of the first core and the second core based on the first number and the second number of available PUs. The master unit may weight the credit numbers of the first core and the second core in proportion to the first number and the second number of available PUs. The master unit may weight the credit numbers such that when each core has the same number of tasks to be processed as it has an available PU, the credit numbers of the first core and the second core are the same. For example, a core with eight available PUs (each assigned a task) may have a credit number of +8. A second core with four PUs (each PU assigned a task) may also have a weighted credit number of +8. More generally, the credit number of each core may be weighted in proportion to the number of available PUs in the core to reflect how busy the core is.

When the master unit weights the credit numbers of each core based on the number of available PUs in that core, the master unit may assign subsequent image rendering tasks to the core with the least work as indicated by the credit numbers of the cores.

When the master unit does not weight the credit numbers of each core to account for the first and second numbers of available PUs, and the credit numbers of each core are the same, the master unit may assign subsequent image rendering tasks to cores having a greater number of available PUs.

The slave units of the first and second cores may inform the master unit of the first and second numbers of available PUs before the master unit assigns any rendering tasks to the cores. The number of available PUs may be configured by the application and may be less than the number of PUs in the core. The number of available PUs may vary during the image rendering process. The core may update the master unit as the number of its available PUs changes, and the master unit may adjust the core's credit number accordingly.

The first subset of image rendering tasks may include a first task, wherein the first task is a task on which a dependent task depends. The master unit may be configured to include a task completion update command in the first subset image rendering task and after the first task. The slave unit of the first core may be configured to send a first task completion update to the master unit when the first core processes the task completion update command. The master unit may be configured to assign and transmit a dependent task of the first task to one of the slave units of the first and second cores only after the master unit has received the first task completion update. The dependent task of the first task is a task that depends on the result of the first task.

Likewise, if the second subset of image rendering tasks includes the first task (the task on which the dependent task depends), the master unit may include a task completion update command in the second subset and after the first task. The slave unit of the second core may be configured to send a second task completion update when it processes the task completion update command.

The dependent tasks of the first task are any image rendering tasks that can be correctly processed only when the earlier first task has been completed. This may occur, for example, when a dependent task requires the output of the first task as input. Any one or more of the tasks in the first and/or second subset of image rendering tasks may be the first task, and the term "first task" does not refer to the location of the task in the first or second subset of image rendering tasks.

The task completion update informs the master unit that all tasks of the subset of tasks have been performed before the task completion update command. By including a task completion update command after the first task, the task completion update informs the master that the first task has been completed, which means that the dependent task can now be processed. The task completion update command may immediately follow the first task such that the core will process the task completion update command immediately after it processes the first task and before it processes any other tasks.

One example of a task completion update command is a workfence command. In processing the work fence command, the slave units within the core may transmit fence updates to the active master units that assign work to the core.

The task completion update may be different from the credit notification transmitted from the unit. For example, the slave of the core may be configured to send a credit notification each time the core completes a task. In contrast, when the core processes a task completion update command, the slave unit may send only the task completion update.

The master unit may be configured to include a memory refresh command in the first subset image rendering task after the first task and optionally before the task completes the update command. The slave unit of the first core may be configured to write all processed work stored in the first core to the shared memory when the first core processes the memory refresh command.

Likewise, if the second subset of image rendering tasks includes the first task (the task on which the dependent task depends), the master unit may include a memory refresh command in the second subset of image processing tasks after the first task and before the task completes the update command. The slave unit of the second core may be configured to write all processed work stored in the second core to the shared memory when the slave unit of the second core processes the memory refresh command.

By following the memory refresh command after the first task, the data generated by processing the first task (output of the first task) is made available to all cores by writing (refreshing) to the shared memory. This enables any cores to handle dependent tasks of the first task, as they all have access to the output data of the first task. The first core and the second core may write to the same shared memory.

By placing the refresh command before the task completion update command, the task completion update is used to inform not only that the primary unit has completed the first task, but also that the refresh has been completed.

Each of the plurality of cores may include a second slave unit configured to manage execution of the second type of image rendering task by the collation. One of the cores may include a second master unit configured to: receiving a second set of image rendering tasks of a second type; assigning a first subset of the second set of image rendering tasks to a first core of the plurality of cores; assigning a second subset of the second set of image rendering tasks to a second, different core of the plurality of cores; transmitting the first subset of the second set of image rendering tasks to a second slave unit of the first core of the plurality of cores; and transmitting a second subset of the second set of image rendering tasks to a second slave unit of a second core of the plurality of cores.

The second set of image rendering tasks consists of image rendering tasks of a different type than the first set of image rendering tasks. For example, the first set of image rendering tasks may be computing tasks and the second set of image rendering tasks may be geometric tasks.

When the core includes both the first master unit and the second master unit, the first master unit and the second master unit may be implemented as two physically separate units in the core.

In some examples, the different cores may include a first master unit and a second master unit. For example, the first core may include a first master unit and the second core may include a second master unit. The second master unit may assign and transmit a first subset of the second set of image rendering tasks to the second slave unit of the first core and a second subset of the second set of image rendering tasks to the second slave unit of the second core.

In some examples, each of the plurality of cores may include a first master unit and a second master unit. However, in this case, only one of the first main unit and the second main unit may be active.

As well as the active first master unit may maintain a credit number for each core to which it has assigned an image rendering task, the active second master unit may also maintain a credit number for cores to which it has assigned a task. In particular, the first master unit may maintain credit numbers for each of the first slave units of the cores (to which it has assigned work) and the second master unit may maintain credit numbers for each of the second slave units of the cores (to which it has assigned work). When the first master unit assigns an image rendering task to a core, it may adjust the credit number of the core by a first amount, as described above. The first master unit adjusts its credit score for each core only in response to its task having been assigned to that core and in response to that core informing that it has completed one of the tasks, and the second master unit adjusts its credit score for each core only in response to its task having been assigned to that core and in response to that core informing that it has completed one of the tasks. In this way, two different credit numbers may be maintained for each core.

The master unit may be configured to output a first register write command and a second register write command. The first register write command may be addressed to the first core and may include an indication of a first subset image rendering task. The second register write command may be addressed to the second core and may include an indication of a second subset image rendering task. The plurality of cores may be connected by a register bus configured to transfer register write commands between the cores.

The multi-core system may include a register bus that connects each core, thereby enabling transfer of register information between the cores. By using this register bus to transfer image rendering tasks between cores, the need for dedicated connections between cores may be eliminated, thereby saving space on the chip.

The master unit may address the register write command to the core to which it has assigned each subset image rendering task. When a master unit assigns subset tasks to the core it resides on, the master unit may address a register write command to that core that contains an indication of those tasks.

The master unit may transmit the register write command directly to various cores, or may output the register write command to another unit in the cores including the master unit for transmission. When each core includes a plurality of slave units, a register write command is addressable to a particular slave unit in a particular core. The register write command may contain an address in memory where the slave unit may obtain the necessary data to handle the image rendering task.

When the slave units of the cores are configured to transmit credit notifications and/or task completion updates, these credit notifications and/or task completion updates may be in the form of register write commands addressed to the master unit (or to the core comprising the master unit) or in the form of register read commands addressed to the master unit (or to the core comprising the master unit).

The core including the master unit may also include an arbitration unit in communication with the master unit and the slave units of the core. The arbitration unit may be configured to: receiving a register write command from a master unit; and for each register write command: if the register write command is addressed to a core including a master unit, passing the register write command to a slave unit including the core of the master unit; and forwarding the register write command for transmission over the register bus if the register write command is not addressed to a core comprising the master unit.

In other words, the arbitration unit may be configured to route tasks assigned (by the master unit) to the slave units including the cores of the master unit to the slave units without transmitting the tasks over the register bus. The subset tasks assigned to any core other than the core comprising the master unit are not routed to the slave unit comprising the core of the master unit. Instead, they are forwarded by the arbitration unit for transmission over the register bus. This may mean that they are forwarded to another hardware unit in the core comprising the master unit for transmission to the relevant core via the register bus, or they are sent directly to the register bus and transmitted to the relevant core.

In examples where each core includes multiple slave units, an arbitration unit of a core including a master unit may communicate with each slave unit of a core including a master unit, and may route tasks assigned to any of the slave units of the core including the master unit to that slave unit. The master unit may address a task to a particular slave unit by using a particular register address associated with that slave unit.

Each core may include an arbitration unit as described above that communicates with all of the master and slave units of the core. When a core receives a register write command over a register bus that is addressed to a slave unit of the core, an arbitration unit in the core may route the register write command to the slave unit to which the register write command is addressed. In this way, the slave unit receives the subset work assigned to it.

When the slave units of the cores are configured to transmit one or more of a credit notification, a CFI notification, and a task completion update, the respective arbitration units of the first and second cores may be configured to forward the CFI notification, the task completion update, and/or the credit notification to a register bus through which they may each be transmitted to the master unit, or to an associated active master unit when there are multiple active master units. The core including the master unit may be configured to receive credit notifications, task completion updates, or CFI notifications from the slave units of the core. The arbitration unit including the core of the master unit may be configured to send credit notifications, task completion updates, or CFI notifications to the master unit. The credit notification, task completion update, or CFI notification may be in the form of a register read command or a register write command addressed to the core comprising the master unit. The register read/write command may contain information that enables the master unit to identify which core sent the command. In one example, this may be using a particular register address associated with the core that includes the master. The arbitration unit of the first core and the second core may forward the communication to be sent by the first/second core to the master unit to the register bus (if the master unit is in another core). The arbitration unit of the first/second core may determine whether the credit notification, task completion update or CFI notification is addressed to its own core, in which case it may send the credit notification, task completion update or CFI notification to its master unit. As explained above, forwarding for transmission may mean forwarding to another hardware unit in the core or directly to a register bus for transmission to the relevant core through the register bus.

The plurality of cores may each include an interface unit in communication with a register bus. The interface unit of the core including the master unit may be configured to: receiving a first register write command and a second register write command; and transmitting the first register write command to the first core and the second register write command to the second core over the register bus.

The interface unit of the first core may be configured to: receiving a first register write command via a register bus; and forwarding the first register write command to the slave of the first core.

The interface unit of the second core may be configured to: receiving a second register write command via a register bus; and forwarding the second register write command to the slave unit of the second core.

Each interface unit may be a System On Chip Interface (SOCIF). An interface unit of a core comprising a master unit may receive one of a credit notification, a CFI notification and a task completion update in the form of a register read write command from a slave unit of the same core (or from another core via a register bus) and may pass it to the master unit (either directly or via an arbitration unit).

Forwarding the register write command to the slave unit may mean sending it directly to the slave unit to which it is addressed, or through another unit or units within the core (e.g. an arbitration unit).

The interface unit of the first core may be configured to: determining whether the first register write command is addressed to a first reserved register address; and forwarding the first register write command to the slave unit of the first core if the first register write command is addressed to the first reserved register address. The interface unit of the second core may be configured to: determining whether the second register write command is addressed to a second reserved register address; and forwarding the second register write command to the slave unit of the second core if the second register write command is addressed to the second reserved register address.

The reserved register address is a register address that the interface unit of the core has been configured to use for master-slave communication only. When the interface unit receives a register read/write command addressed to a reserved register address, instead of simply reading data from and/or writing data to the register, it will transfer the data to the master or slave of the core, as the case may be, depending on the address. If the register read/write command does not use the reserved register address, the interface unit treats it as a regular register read/write command (meaning that it is not forwarded to the slave unit of the core). In this way, the interface unit can distinguish between conventional register read/write commands and master-slave communications.

Each core may have more than one reserved register address associated with it. For example, the first core may be associated with a first reserved register address of a slave unit and a second register address of a master unit in the first core. In general, each slave in each core may be associated with a unique reserved register address. Likewise, each master in each core may be associated with a unique reserved register address.

Communications sent from the slave unit, such as credit notifications and task completion updates, may also be addressed to the reserved register address, and if these communications are addressed to the reserved register address associated with the master unit, the interface unit of the core comprising the master unit may only send these communications to the master unit.

Forwarding the register write command to a slave unit may mean forwarding it directly to the slave unit or forwarding it indirectly to the slave unit via another hardware unit, e.g. an arbitration unit.

The multiple cores may each include the same number of master units, and may each include the same number of slave units.

The cores of the graphics processing system may be physically identical, meaning that they include identical components-in particular, the master component in each core may be identical, and the slave components in each core may be identical. The cores are capable of operating independently in a single core system or configuration, as each core has a slave unit and a master unit.

The first core or the second core may comprise a master unit.

According to another aspect, there is provided a method of transmitting an image rendering task in a graphics processing unit including a plurality of cores, the method comprising:

receiving, by a master unit of a core of the plurality of cores, a set of image rendering tasks;

assigning, by the master unit, a first subset of image rendering tasks to a first core of the plurality of cores;

assigning, by the master unit, a second subset of image rendering tasks to a second core of the plurality of cores;

transmitting, by the master unit, a first subset of image rendering tasks to a slave unit of the first core; and

a second subset of image rendering tasks is transmitted by the master unit to a slave unit of the second core.

The method may further comprise: storing, by the master unit, a credit number for each of the first core and the second core; adjusting, by the master unit, credit numbers of the first cores by a first amount for each of the first subset of image rendering tasks; and adjusting, by the master unit, the credit number of the second kernel by a first amount for each of the second subset of image rendering tasks; transmitting, by the slave unit of the first core, a first credit notification to the master unit when a task of the first subset of image rendering tasks has been processed; transmitting, by the slave unit of the second core, a second credit notification to the master unit when a task of the second subset of image rendering tasks has been processed; adjusting, by the master unit, the credit number of the first core by a second amount when the master unit receives the first credit notification; and adjusting, by the master unit, the credit number of the second core by a second amount when the master unit receives the second credit notification, wherein one of the first amount and the second amount is positive and the other is negative.

The method may further comprise: assigning, by the master unit, a subsequent image rendering task to the slave unit currently assigned the least-working core based on the credit number of each core; adjusting, by the master unit, credit numbers of cores assigned subsequent image rendering tasks by a first amount; and transmitting, by the master unit, the subsequent image rendering task to the slave unit of the core to which it has been assigned.

The method may further comprise: image rendering tasks that are directly related to a first number of available processing units, referred to herein as PUs, and a second number of available PUs are assigned to the first core and the second core, where the first number of available PUs is the number of available PUs in the first core and the second number of available PUs is the number of available PUs in the second core.

The method may also include weighting, by the master unit, credit numbers of the first core based on a first number of available PUs and weighting credit numbers of the second core based on a second number of available PUs.

The method may further comprise: including, by the master unit, a task completion update command after the first task in the first subset image rendering tasks; processing, by the first core, the first task; completing an update command by the first core processing task; and transmitting, by the slave unit of the first core, a task completion update to the master unit; assigning, by the master unit, a dependent task of the first task to one of the slave units of the first core and the second core; and transmitting, by the master unit, the dependent task to the core to which it has been assigned.

In the same manner, the method may include including a task completion update command (by the master unit) in the second subset image rendering task and after the first task, and transmitting a task completion update (by the slave unit of the second core) when the slave unit of the second core processes the task completion update command.

The method may further comprise: including, by the master unit, a memory refresh command in the first subset image rendering task after the first task and optionally before the task completes the update command; processing, by the first core, a memory refresh command; and writing, by the slave unit of the first core, all output data stored in the first core to the shared memory.

In the same manner, the method may include including (by the master unit) a memory refresh command in the second subset image rendering task and after the first task (and optionally before the task completes the update command), and writing (by the slave unit of the second core) all processed memory stored in the second core to the shared memory. The first core and the second core may be written to the same shared memory, or to different shared memories.

The method may further comprise: receiving, by a second master unit in any one of the plurality of cores, a second set of image rendering tasks of a second type; assigning, by the second master unit, a first subset of the second set of image rendering tasks to the first core; assigning, by the second master unit, a second subset of the second set of image rendering tasks to the second core; transmitting, by the second master unit, the first subset of the second set of image rendering tasks to the second slave unit of the first core; and transmitting, by the second master unit, a second subset of the second set of image rendering tasks to a second slave unit of the second core.

The transmission of the first subset and the second subset may include outputting, by the master unit, the first register write command and the second register write command. The first register write command may be addressed to the first core and may include an indication of a first subset image rendering task. The second register write command may be addressed to the second core and may include an indication of a second subset image rendering task. Multiple cores may be connected by a register bus to communicate register write commands between the cores.

The transmitting may further comprise: receiving, by an arbitration unit including a core of a master unit, a plurality of register write commands from the master unit; and for each register write command: if the register write command is addressed to a core including the master unit, transmitting, by the arbitration unit, the register write command to a slave unit including the core of the master unit; and forwarding, by the arbitration unit, the register write command to the register bus if the register write command is not addressed to the core comprising the master unit.

If the register write command is not addressed to a core including the master unit, the arbitration unit may forward the register write command to another hardware unit in the core including the master unit for onward transmission to the relevant other core via the register bus. Alternatively, the arbitration unit may forward the command directly to the register bus for transmission to the associated core.

The transmitting may further comprise: receiving, by an interface unit including a core of a master unit, a first register write command and a second register write command; transmitting, by an interface unit including a core of the master unit, a first register write command to the first core and a second register write command to the second core through a register bus; receiving, by an interface unit of a first core, a first register write command; forwarding, by the interface unit of the first core, the first register write command to the slave unit of the first core; receiving, by an interface unit of a second core, a second register write command; and forwarding, by the interface unit of the second core, the second register write command to the slave unit of the second core.

The method may further comprise: determining, by an interface unit of the first core, whether the first register write command is addressed to a first reserved register address; and forwarding the first register write command to the slave unit of the first core if the first register write command is addressed to the first reserved register address; determining, by an interface unit of the second core, whether the second register write command is addressed to the second reserved register address; and forwarding the second register write command to the slave unit of the second core if the second register write command is addressed to the second reserved register address.

Each core and optionally each slave unit and slave units within each core may be associated with different reserved register addresses.

There is also provided a graphics processing system comprising a GPU as outlined above and/or configured to perform the method as outlined above. The graphics processing system may be included in hardware on an integrated circuit.

A method of manufacturing a graphics processing system as outlined above using an integrated circuit manufacturing system is also provided.

There is also provided a method of manufacturing a graphics processing system as outlined above using an integrated circuit manufacturing system, the method comprising: processing the computer readable description of the graphics processing system using the layout processing system to generate a circuit layout description of the integrated circuit containing the graphics processing system; and manufacturing the graphics processing system from the circuit layout description using the integrated circuit generation system.

There is also provided computer readable code configured such that when the code is executed, the method as outlined above is performed. A computer readable storage medium (optionally non-transitory) having encoded thereon the computer readable code is also provided.

An integrated circuit definition data set is also provided that, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture a graphics processing system as outlined above.

There is also provided a computer readable storage medium having stored thereon a computer readable description of a graphics processing system as outlined above, which when processed in an integrated circuit manufacturing system causes the integrated circuit manufacturing system to manufacture an integrated circuit comprising the graphics processing system.

There is also provided a computer readable storage medium (optionally non-transitory) having stored thereon a computer readable description of a graphics processing system as outlined above, which when processed in an integrated circuit manufacturing system causes the integrated circuit manufacturing system to: processing the computer readable description of the graphics processing system using the layout processing system to generate a circuit layout description of the integrated circuit containing the graphics processing system; and manufacturing the graphics processing system from the circuit layout description using the integrated circuit generation system.

There is also provided an integrated circuit manufacturing system configured to manufacture a graphics processing system as outlined above.

There is also provided an integrated circuit manufacturing system comprising:

a computer readable storage medium (optionally non-transitory) having stored thereon a computer readable description of a graphics processing system as outlined above;

a layout processing system configured to process the computer readable description to generate a circuit layout description of an integrated circuit containing the graphics processing system; and

an integrated circuit generation system configured to fabricate the graphics processing system in accordance with the circuit layout description.

The layout processing system may be configured to determine location information of logic components of a circuit derived from the integrated circuit description to generate a circuit layout description of an integrated circuit containing the graphics processing system.

As will be apparent to those skilled in the art, the above features may be suitably combined and combined with any of the aspects of the examples described herein.

Drawings

Examples will now be described in detail with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a GPU according to an example;

FIG. 2 is a flow chart illustrating a method according to an example;

FIG. 3 illustrates a block diagram of a GPU according to an example;

FIG. 4 is a flow chart illustrating a method according to an example;

FIG. 5 is a flow chart illustrating a method according to an example;

FIG. 6 is a flow chart illustrating a method according to an example;

FIG. 7 is a block diagram of a GPU according to an example;

FIG. 8 is a flow chart illustrating a method according to an example;

FIG. 9 is a block diagram of a GPU according to an example;

FIG. 10 is a flow chart illustrating a method according to an example;

FIG. 11 is a block diagram of a GPU according to an example;

FIG. 12 is a flow chart illustrating a method according to an example;

FIG. 13 is a block diagram of a GPU according to an example;

FIG. 14A is a flow chart illustrating a method according to an example;

FIG. 14B is a flow chart illustrating a method according to an example;

FIG. 15 is a block diagram of a GPU according to an example;

FIG. 16 is a flow chart illustrating a method according to an example;

FIG. 17 is a flow chart illustrating a method according to an example;

FIG. 18 illustrates a computer system in which a graphics processing system is implemented; and

FIG. 19 illustrates an integrated circuit manufacturing system for generating an integrated circuit containing a graphics processing system.

The figures illustrate various examples. Skilled artisans will appreciate that element boundaries (e.g., blocks, groups of blocks, or other shapes) illustrated in the figures represent one example of boundaries. In some examples, it may be the case that one element may be designed as a plurality of elements, or that a plurality of elements may be designed as one element. Where appropriate, common reference numerals have been used throughout the various figures to indicate like features.

Detailed Description

The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.

Embodiments will now be described by way of example only.

An alternative to the parallel processing system described above, relying on a central hub, is a parallel processing system that uses a fixed mapping of work and cores. In a fixed mapping system, there is no central hub. Instead of the central hub distributing rendering tasks, a fixed mapping is used to distribute tasks among the cores-thereby assigning rendering tasks to the cores in a predetermined manner. A simple example of this in a dual core system is splitting the scene in half around a vertical axis. One core may be assigned the image rendering task of the left half of the scene and another core may be assigned the task of the right half of the scene.

While this fixed mapping system solves some of the problems associated with a central hub, it is affected by skew between cores, which reduces the degree of parallelization of the GPUs. Skew refers to the difference in processing time between cores. Skew occurs when one core completes its assigned task before another core and becomes idle. The greater the skew, the more time that some cores of the GPU are idle and the less time that the cores of the GPU are processing tasks in parallel. To achieve maximum parallelization, skew should be minimized.

Skew is the result of the fact that different image rendering tasks have different computational requirements and take different time to process. It is not possible to determine in advance how long each task takes to process, which means that while a fixed map can be easily configured to ensure that the same number of tasks are provided to all cores, it is not possible to allocate tasks according to the fixed map so that each core completes its work at the same time. This means that, although the GPU initially processes tasks in parallel, as the cores progress through their workload, certain cores inevitably complete before others and become idle. As more cores become idle, the degree of parallelization decreases and the task processing rate of the GPU decreases.

Another cause of skew is contention within the nucleus. Contention occurs within a core when the core has been assigned multiple tasks competing for the core's resources. For example, consider that a first core is assigned both geometric processing tasks and fragment processing tasks, while a second core is assigned only geometric processing tasks. The second core is able to process its geometric processing tasks, however, if the fragment processing tasks assigned to the first core have been indicated as being high priority, the first core will preferentially process the geometric tasks before processing them. This competition between the geometry processing tasks and the fragment processing tasks in the first core delays the completion of the geometry processing tasks, which may result in further delays down the image processing pipeline.

It is desirable to address the issues of chip space and skew in order to achieve a high performance multi-core GPU.

Examples according to the present disclosure provide a GPU. The GPU includes a plurality of cores. One of the cores includes a master unit responsible for distributing tasks among the cores.

An exemplary GPU is depicted in fig. 1.

The GPU 100 includes a first core 110, a second core 120, and a third core 130. Each of the first, second and third cores includes a slave unit 111, 121, 131. In addition, in this example, the third core includes a master unit 140.

The first core and the second core (and optionally the third core) may each include one or more Processing Units (PUs) 199. Each processing unit 199 in a core may communicate with a slave unit of the core. The processing unit 199 may be responsible for processing image rendering tasks. In some examples, the slave units may each include one or more processing units. The one or more processing units may include dedicated hardware configured to perform certain types of image rendering tasks. For example, the processing unit may comprise dedicated hardware configured to handle geometric processing tasks. However, the one or more processing units need not be dedicated to performing a particular type of image rendering task. Rather, they may be capable of performing a variety of different types of image rendering tasks. In some examples, one or more processing units may be shared in that they are assigned work by different slave units within the core that process different work types.

The main unit 140 is configured to receive a set of image rendering tasks. The master unit 140 is configured to assign a first subset of image rendering tasks to the first core 110 and a second subset of image rendering tasks to the second core 120. The master unit 140 is configured to transmit the first subset to the slave units 111 of the first core 110 and the second subset to the slave units 121 of the second core 120. The master unit 140 may also assign and transmit a third subset image rendering task to the slave unit 131 of the third core 130.

The slave units 111, 121, 131 of the cores are configured to receive the image rendering tasks assigned to and transmitted to them by the master unit 140. The slave units 111, 121, 131 may assign the received image rendering tasks to processing units 199 within their cores for processing.

Fig. 2 is a flowchart depicting an exemplary method 200 performed by GPU 100. At step 210, the master unit 140 of the third core 130 receives a set of image rendering tasks. At step 220, the master unit 140 assigns the first subset to the first core 110, and at step 240, the master unit 140 transmits the first subset to the slave unit 111 of the first core. Similarly, at step 230, master unit 140 assigns a second subset to second core 120, and at step 250, master unit 140 transmits the second subset to slave unit 121 of second core 120. Although step 240 must always occur after step 220 and step 250 must occur after step 230, there is no specific relationship between the left branch (steps 220 and 240) and the right branch (steps 230 and 250) of the flowchart of FIG. 2.

Master unit 140 is responsible for assigning and distributing tasks among the cores of the GPUs. In other words, the master unit 140 enables the core to which it belongs (in this example, the third core 130) to assume the work allocation function of the central hub. However, unlike a central hub that is not capable of performing image rendering tasks, the third core 130 is a fully functional core and is capable of performing the same type of image rendering tasks as the first core 110 and the second core 120. As described above, in some examples of GPU 100 of fig. 1, master unit 140 may assign and transmit the third subset to third core 130.

In the above example, the third core 130 of the GPU includes a master unit, and the first core 110 and the second core 120 process tasks. However, this is not always the case. For example, fig. 3 depicts a GPU 300 in which a first core 310 includes a master unit 340 in addition to a slave unit 311. The master unit 340 is configured to assign a first subset to the first core 310 (assign the first subset to its own core) and a second subset to the second core 320. The master unit 340 is configured to transmit a first subset to the slave units 311 of the first core 310 and a second subset to the slave units 321 of the second core 320.

In these aspects, one core of the GPU 100/300 may not only assume the work distribution function of the central hub, but may also actively handle image rendering tasks, thereby improving GPU performance and more efficiently utilizing chip space.

The set of image rendering tasks may be received by the master unit 140/340 from a client driver/application driver, or from another hardware unit in the GPU 100/300.

As explained above, to account for unpredictable variations in the complexity of image rendering tasks (how much "work" each task involves) and to reduce skew, tasks are dynamically assigned to cores. Dynamic assignment of tasks means that when skew begins to occur between cores, the skew can be corrected by providing additional tasks to the cores whose tasks are being processed more quickly. However, to load balance the cores, master unit 140 requires additional tasks that it may later assign to the least busy core. Thus, when tasks are first assigned to cores of the GPU, master unit 140 may not assign some tasks, thereby reserving them for load balancing. In other words, only a portion of the total tasks may be assigned to cores at a given time. In the example of fig. 1, master unit 140 may assign less than 50% of the tasks in the set of tasks to first core 110 and less than 50% of the tasks to second core 120. In some examples, master unit 140 may assign substantially less than 50% of the tasks to any one core at a given time, e.g., assign 1% or less than 1% of the tasks in the set of tasks to any one core at any given time.

Unassigned tasks may be assigned to a core to compensate for skew when the core completes its tasks. This process achieves load balancing. Load balancing aims to increase the amount of time it takes for a core to concurrently process image rendering tasks and thereby reduce overall processing time. In other words, load balancing attempts to prevent one core from completing its tasks and becoming idle while the other core is still processing tasks. This may also be considered to ensure that each core is provided with the same amount of work to be processed in proportion to its processing power. When each core has the same processing power, this means that the same workload is provided for the core. This is an example that we will discuss in more detail.

As described above, it is impossible to determine in advance how much work each task involves, and thus it is difficult to provide the same workload for the cores. However, by assigning additional tasks to the least busy cores, the master may still be evolving towards equating the workload currently assigned to each core. By continuing this process within the image rendering process, the master unit continues to evolve the cores towards having the same workload, which means that the cores have generally processed the same workload and are therefore active for the same time, increasing parallelization.

In some examples, credit-based systems may be used for load balancing. In the context of GPU 100 of fig. 1, an exemplary implementation of a credit system will be explained in more detail in the example where cores have the same processing capabilities. Regardless of the number of cores in the GPU, and regardless of which core includes an active master, the principles described below apply.

The master unit 140 may be configured to store a credit number for each core 110, 120, 130 to which it is configured to assign tasks. The credit numbers are typically all initialized with the same value (e.g., zero value), however, the magnitude of the initial value is not important and can have any size. The master unit 140 may be configured to adjust the credit number of a core by a first amount (e.g., by incrementing by one) when assigning a task to the core. Each of the slave units 111, 121, 131 may be configured to send a credit notification to the master unit 140 when its core completes an image rendering task. The credit notification may include information identifying the slave unit sending the credit notification so that master unit 140 knows which core has completed the task. Upon receiving the credit notification, master unit 140 may be configured to adjust the credit number of the core sending the credit notification by a second amount (e.g., by decrementing by one). By adjusting the credit number of a core by a set amount in one direction for each task assigned to the core, and by adjusting the credit number of a core in the opposite direction for each credit notification sent by the slave to the core, master 140 maintains a running count of the number of outstanding tasks that have been assigned to each core. The credit number represents how busy a core is (busy in work assigned by a particular master), and the difference between the credit numbers of two or more cores is an indication that one core is not more busy (busy in work of that type) than another core. By storing and maintaining credit numbers for the cores, the master unit 140 tracks how busy each core is in the overall image rendering process, thereby enabling the master unit to load balance the cores. This will be explained in more detail with reference to fig. 4 and with respect to the exemplary GPU 100 of fig. 1 (although the corresponding method is applicable to other GPUs, such as GPU 300 of fig. 3).

As with the method 200 of fig. 2, in step 210, the master unit 140 of the third core 130 receives a set of image rendering tasks. Before the master unit 140 assigns a task to a core, it first stores 400 the credit number of each available core. The available cores are cores that are currently configured to handle image rendering tasks. In this example, the first core 110 and the second core 120 are available cores. The master unit 140 then assigns 220 a first subset of the set of image rendering tasks to the first core 110, and adjusts 410 the credit number of the first core 110 by a first amount for each task assigned to the first core 110 (for each task in the first subset). At step 240, the master unit 140 transmits the first subset to the slave units 111 of the first core 110. Similarly, the master unit 140 assigns 230 a second subset of the set of image rendering tasks to the second core 120 and adjusts 420 the credit number of the second core 120 by the first amount for each task in the second subset. The master unit 140 then transmits 250 the second subset to the slave unit 121 of the second core.

After the first core 110 processes the tasks in the first subset, the slave unit 111 of the first core transmits 430 a first credit notification to the master unit 140. The master unit 140 adjusts 450 the credit number of the first core 110 by a second amount upon receiving the first credit notification. Similarly, after the second core 120 processes the tasks in the second subset, the slave unit 121 transmits 440 a second credit notification to the master unit 140, and the master unit 140 adjusts 460 the credit number of the second core 120 by a second amount upon receiving the second credit notification. The slave unit 111 of the first core transmits 430 a first credit notification each time the first core 110 completes a task and the slave unit 121 of the second core 120 transmits 440 a second credit notification each time the second core 120 completes a task. When the third core 130 is also assigned a task, the slave unit 131 of the third core 130 may transmit a third credit notification when the third core 130 processes the task, and the master unit 140 may adjust the credit number of the third core 130 by a second amount when the third credit notification is received.

As depicted in fig. 5, after master unit 140 has adjusted the credit number of one core in response to the credit notification, master unit 140 may assign 500 a subsequent image rendering task or tasks to the least busy core (as indicated by the credit number of each available core). In step 510, master unit 140 adjusts the credit number of the core to which a subsequent task has been assigned by a first amount and, in step 520, transmits the task to the core to which it has been assigned. By assigning additional tasks to the least busy core, the core is prevented from becoming idle. This helps to reduce skew, thereby maintaining parallel processing of the image rendering tasks. By adjusting the credit numbers of the cores to which additional tasks have been assigned, master unit 140 ensures that the credit numbers remain up to date and truly reflects how busy each core is.

The method may be cycled as depicted in fig. 5. By cycling through the method, load balancing of the cores is continuously maintained, thereby ensuring maximum parallelization over as long a time as possible. It should be noted that each slave unit 111, 121, 131 is capable of buffering a limited number of image rendering tasks. This is the maximum number of tasks that can be assigned to a core at any time by master unit 140. In practice, rendering images typically involves many image rendering tasks that are orders of magnitude larger than the buffer size of each core. By continuing to load balance the cores as they process the image rendering tasks, the cores may process tasks in parallel for at least a majority of the image rendering tasks.

While FIG. 5 presents that assignment of a subsequent task occurs after the master unit receives one or more notifications from the slave units that complete a subset of tasks, it should be understood that this is not a requirement for assigning a subsequent task. For example, if a master unit knows (e.g., based on the credit numbers of the individual slave units) that a slave unit has a greater capacity than the currently allocatable work, it can immediately allocate the newly received work to that slave unit, regardless of whether the slave unit has completed the previously allocated work. In other words, the allocation of work depends on the work available for allocation and the slave's capabilities.

When assigning 220, 230 the first and second subsets to the first and second cores 110, 120, the master unit 140 may select the sizes of the first and second subsets (the number of tasks contained in each subset) such that they completely fill the buffers of the slave units 111, 121. In this way, the credit numbers of the first core 110 and the second core 120 after the first subset and the second subset have been assigned may represent the busiest cores. Any credit number indicating that one of the cores is not more busy than its initial state means that the core has the ability to accept additional tasks. If the credit number of the core is an initial value, it is at maximum capacity and the master unit does not assign it any additional work.

The first and second amounts may have any magnitude, but opposite signs, such that changing the credit number by one of the first and second amounts increases the credit number, and changing the credit number by the other of the first and second amounts decreases the credit number. The magnitude of the amount is not important because it is the difference between the credit numbers and not the magnitude of the number itself that indicates to master unit 140 which core is more busy. In some examples, the first amount is positive and the second amount is negative. In such examples, the greater the positive credit number of a core, the more tasks it is assigned to be processed, and the more busy it is. Conversely, if the first amount is negative and the second amount is positive, the greater the negative credit number of the core, the more busy the core. In some examples, the first quantity and the second quantity have the same magnitude.

When determining which core is least busy in order to load balance the cores, master unit 140 may compare the credit numbers and identify the core with the least positive credit number (or the least negative credit number, depending on the sign of the first and second amounts) as the least busy core.

One useful factor to consider in alleviating skew is the ability of each core to handle tasks. Within a core, a slave may be responsible for assigning tasks assigned to that core to PUs 199 within the core. PU 199 may handle image rendering tasks. In some examples, each core of the GPU may be identical. That is, all cores may have the same number of master units (considered in further detail below), slave units, and PUs, and each master unit and slave unit in a core has the same corresponding unit in each of the other cores. However, this need not be the case. In some examples, the cores may be different (e.g., the cores may have different numbers of master units, slave units, and/or PUs). When two cores have different numbers of PUs 199, they have different processing tasks' capabilities. For example, a core with two PUs 199 may handle twice as many tasks at a time as a core with only a single PU 199, all other things being equal.

Even though the cores are physically identical, they may still be configured to have different processing capabilities. For example, a core may be partitioned-half of the PU 199 of the core may be reserved for geometric processing tasks while the other half is reserved for computational processing tasks. This core has half the effective processing power for geometry processing as compared to a non-partitioned core with the same total number of PUs 199. The number of PUs 199 in a core that are available to perform a particular type of image rendering task is referred to as the number of available PUs 199 in that core.

Each core may send information about the number of its available PUs 199 to the master unit 140. When assigning 220, 230, 500 tasks to cores, the master may consider any difference between the number of available PUs 199 in each core, assigning tasks to cores that are directly related to the number of available PUs 199 in each core. In other words, master unit 140 may consider the credit number of each core (indicating how busy the core is) and the number of available PUs 199 that the core has (its overall ability to complete the task). For example, where a first core 110 has two available PUs 199 and a second core 120 has four available PUs 199, the master unit 140 may initially assign twice as many tasks to the second core 120 as the first core 110 (fill both cores to maximum capacity), and consider the credit number of the first core 110 as twice the number of unprocessed tasks that the core 110 actually has. In this way, the master unit 140 considers the difference in processing power between each core by weighting the credit numbers of the first cores 110, thereby better balancing the cores and reducing skew. Alternatively, the master unit 140 may initially assign the same number of tasks to each core based on the buffer size of the slave unit, as mentioned above. While this initially means that the workload of each core is not proportional to its processing power, during image processing, load balancing may compensate for this to reduce/eliminate skew.

Weighting is not the only way master unit 140 may consider a different number of PUs 199 in a core. For example, master unit 140 may be biased to allocate work to cores with more available PUs 199, such that when two cores with different numbers of PUs 199 have the same credit number, master unit 140 preferentially assigns tasks to cores with a larger number of available PUs 199.

One complicating factor that may need to be considered when processing tasks in parallel is task dependency. Some tasks (referred to herein as dependent tasks) rely on the completion of an earlier task (referred to herein as a first task). "first task" refers to a task that another task depends on. One example of a task dependency is a dependent task that requires the output of a first task as an input. If the dependent task is processed before the first task is processed, it will not be processed correctly and the final image will contain errors. Typically, the set of tasks is provided to the main unit 140 in an order in which the application program wishes to process the set of tasks, such that the first task is always processed before the dependent tasks of the first task. This is because the image rendering applications may not be aware that they are running on a multi-core GPU and thus provide a single control flow that is suitable for processing by a single core.

When splitting tasks between cores, dependent tasks may be sent to the cores for processing and processed before the tasks on which they depend have been processed. In order to maintain the integrity of the final image, this must be prevented. One solution is to ensure that a first task and its dependent tasks are always assigned to the same core in the required order, so that the core always processes the first task before the second task. However, this solution limits the extent to which tasks can be processed in parallel and may affect the performance of the GPU. This is especially the case when the dependent task depends on a plurality of first tasks. These first tasks would ideally be processed in parallel, greatly reducing overall processing time, but the above solution would prohibit this, but instead require processing all tasks on a single core.

A solution enabling to maintain a higher degree of parallelization is explained with reference to fig. 6, and utilizes tasks to complete update commands. The task completion update command is a command that, when processed by a core, causes the slave unit of the core to transmit a task completion update to the master unit 140. One example of a task completion update command is a workfence command. When a core processes a working fence command, it causes the slave unit of that core to transmit a fence update to the master unit that assigned the working fence command to the core.

In this example, the task completion update differs from the credit notification in that the credit notification only indicates that the task has been processed by the core, while the task completion update specifically indicates that the task completion update command has been processed. In some examples, credit notification may also be used for the purpose of task completion updates. For example, the credit notification may include an indication of which task has been completed. Alternatively, the master unit 140 may determine that a particular task has been processed after the particular credit number has been received from a core based on the credit number of the core at the time the master unit assigned the particular task to the core. For example, if master unit 140 assigns a task to second core 120 when the second core already has a credit number of nine, master unit 140 may determine that the task has been processed once it has received ten credit notifications from the second core.

FIG. 6 is a flow chart depicting an exemplary method by which the GPU 100 of FIG. 1 may use tasks to complete update commands. For simplicity, the initial step of receiving 210 the subset of tasks and assignments 220, 230 is not shown, but is performed as shown in FIG. 2. At step 600, immediately after step 220 and before step 240, the master unit 140 includes a task completion update command after (preferably immediately after) the first task in the first subset. The first subset (including the task completion update command) is then transmitted 240 to the slave unit 111 of the first core 110. After the first core 110 has processed 610 the first task, it processes 620 the task completion update command. Upon processing 620 the command, the slave unit 111 transmits 630 the task completion update to the master unit 140. This update informs the master unit 140 that the first task has been processed by the first core 110. The master unit 140 can then assign the dependent task to any core without any risk of handling the dependent task before the first task. For example, master unit 140 may assign 640 a dependent task and transmit 650 to second core 120. Preferably, the master unit 140 will assign dependent tasks to cores with least work as indicated by credit numbers as explained above in order to continue load balancing the cores.

This approach allows parallel processing of tasks without the risk of processing dependent tasks before the tasks on which the dependent tasks depend. When the first subset or the second subset contains dependent tasks, the master unit 140 may refrain from transmitting these tasks to the cores until the master unit 140 receives a task completion update for the first task on which the dependent tasks depend. The dependent task itself may be the first task of the other dependent tasks and thus may be accompanied by a task completion update command.

In some examples, as a core processes a task, it stores the resulting data (output of the task) in local memory (e.g., cache) that can only be accessed by the core (and may be located within the core). The core may write this data to shared memory accessible to all cores on a regular basis. However, this may lead to another dependency problem—in the case where the first task of the dependent task is all processed, the processed data may be stored in local memory inaccessible to the core processing the dependent task. If this occurs, at least a portion of the input data for the dependent task is not available and the dependent task will not be processed correctly. For example, if the first core 110 has not written the output of the first task to shared memory accessible to all cores, the second core 120 will not be able to properly process the dependent task even though the first task has been processed. To solve this problem, a memory refresh command may be used in addition to the task completion update command. When processed by the core, the memory refresh command causes the core to write all data stored in the local memory to a shared memory accessible to all cores. An exemplary GPU 700 including a shared memory 710 is depicted in fig. 7. An exemplary method of utilizing a memory refresh command is explained below with reference to fig. 8. As with FIG. 6, steps 210-230 are not shown in the flowchart, but are present in the method. In addition, steps 640 and 650 are not shown (but still exist in the method).

In addition to including 600 a task completion update command after the first task, master unit 140 may also include 800 a memory refresh command after the first task (and preferably before the task completion update command). When the first core 110 processes 810 the memory refresh command (after processing 610 the first task), it writes 820 all output data stored in a local memory (not shown) of the first core 110 to the shared memory 710. By writing the output data to the shared memory 710, the output data is accessible to all other cores, any of which may then process dependent tasks.

It is preferable to include a memory refresh command before the task completion update command because the task completion update transmitted by the slave unit 111 is then used to inform the master unit 140 that the first task has been processed and that the output data of the first task is available in the shared memory 710.

Another useful type of command that may be transmitted with the subset is a Cache Flush Invalidate (CFI) command. CFI commands may be broadcast to all cores in the GPU. More specifically, the master unit may send the CFI to all cores to which it has assigned work. Similar to the memory refresh command, the CFI command causes any core that processes it to write all the data stored in the core to shared memory 710. Generally, CFI commands are used when a set of tasks received by the main unit 140 has been fully processed. In other words, master unit 140 may broadcast a CFI command when no additional tasks are assigned to the core from the set of tasks. This prepares the core to receive a new task from a new set of tasks. CFI is useful because it prevents external processes (e.g., GPU firmware or software running on an external host) from having to instruct the core to refresh its memory, which is slow and increases the idle time between the GPU completing one workload and being sent out another, thereby degrading performance. After the core performs the CFI, the slave of the core may transmit a CFI notification to the master informing it that the CFI is complete. In some examples, the slave units 111, 121, 131 may be configured to automatically perform CFI and send CFI notifications. For example, a slave may be configured to perform CFI when its core has no additional tasks to process.

So far, multi-core systems have been described as including only a single master unit. However, this need not be the case. In some examples, each core may include a master unit in addition to a slave unit, or a plurality of master units and slave units. In any of the examples provided above, each core may include a master unit and a slave unit, but only one master unit may be active.

As already mentioned above, the image rendering tasks may comprise a plurality of different types of tasks, such as fragments, geometry and computation tasks, and for each type of task the GPU may comprise dedicated hardware for performing that particular type of task. Typically, the set of tasks provided to the GPU will include only one of these types of tasks. The management of these tasks may be separated such that one master unit and a group of slave units interact with only one type of task at least at any given time. Thus, processing two types of tasks in parallel and load balancing cores for each type may require at least two active master units in a multi-core system and at least two slave units per core. A master unit configured to receive, assign, and transmit only geometric tasks may be referred to as a geometric master unit, and its slaves as geometric slaves. The master unit and the slave unit configured in the same manner but configured for the fragment processing task may be referred to as a fragment master unit and a fragment slave unit. FIG. 9 depicts one example of a GPU 900 that includes geometry and segment master and slave units.

In GPU 900, third core 930 includes segment master unit 941 and segment slave unit 931, and geometry master unit 942 and geometry slave unit 932. The first core 910 and the second core 920 each include a segment slave unit 911, 921 and a geometry slave unit 912, 922. In some examples, the first core 910 and the second core 920 may also each include a segment master and a geometry master such that the three cores are identical, however for simplicity we will consider only the example where the third core 930 includes a master.

As explained above, the fragment master 941 is configured to receive fragment processing tasks, while the geometry master 942 is configured to receive geometry processing tasks. The multi-core GPU 900 may perform any of the methods described above. The segment master and segment slave may perform any of the methods described above simultaneously with but independent of the geometry master and geometry slave. For example, segment master 941 may maintain credit numbers for cores while geometry master 942 also maintains credit numbers for cores independently of segment masters. More specifically, the segment master 941 may maintain credit numbers for each segment slave 911, 921, 931 of the core to which it assigns work, and the geometry master 942 may maintain credit numbers for each geometry slave 912, 922, 932 of the core to which it assigns work. When the segment master 941 assigns a segment processing task to a core, it may adjust the credit number of the core by a first amount, as described with reference to fig. 4 and/or 5. However, geometry master 942 will only adjust the credit number of a core in response to that core being assigned geometry processing tasks and in response to the core informing that one of its tasks is completed. Accordingly, the same is true for the segment master 941, which adjusts its credit number only in response to assignment of segment tasks. In this way, two different credit numbers may be maintained for each core. One credit number relates only to the extent to which the core is busy with a fragment processing task, and the other credit number relates only to the extent to which the core is busy with a geometric processing task. By load balancing each core independently as described above, both master units 941, 942 help reduce skew between cores. It is particularly preferred that both/all active master units are load balanced, as this enables compensation of contention within the core. For example, if the first core 910 is assigned a set of high priority fragment processing tasks, this may delay the processing of any geometric processing tasks assigned to the first core 910, and the geometric credit number of the first core 910 will remain high. This means that the primary unit 942 does not assign additional geometry processing tasks to the first core 910, which only results in a constant backlog of tasks that may cause skews. Although the master and slave units have been described above as geometric units and fragment units, they may in turn be configured to handle other types of tasks. Most generally they can be described simply as a first master/slave unit and a second master/slave unit configured to handle a first/second type of task.

In some examples, multiple active master units may cooperate for load balancing. The master may maintain a shared credit number for each core that represents the total number of all types of tasks currently assigned to that core. Skew between cores may again be prevented using the same load balancing principle that assigns additional tasks to the least busy cores. In some examples, a single master unit may receive a heterogeneous set (a set containing a combination of task types) of tasks and may split the tasks between cores, maintaining a single credit number for each core as described above.

Although in the example of fig. 9, the third core 930 includes two active master units, this is not necessarily the case. For example, the first core 910 may include one active master unit while the third core includes another. In some examples, the first core may include all active master units. In some examples, the third core 930 may be absent and the GPU may include only two cores. Preferably, each core includes the same number of master units as slave units, and each core is the same, as each master unit and each slave unit in a core have the same counterparts in each of the other cores. This also applies to the case where there is only one active master unit in a multi-core system, such as the example of fig. 1 or 3. By making the cores identical, even though this may create redundancy due to inactive master units, the cores are able to operate in parallel and independently as described above, with at least one master unit in each core being active and providing work to the corresponding slave unit of that core. Furthermore, using the same core including both master and slave units makes the GPU architecture easier to expand, that is, it becomes simpler to design a larger (or smaller) GPU by adding (or removing) cores without breaking the overall layout, and designs with different numbers of cores are easier to verify correctness because there are fewer different units overall. Because of the relatively small on-chip size of the master unit, the inclusion of an inactive master unit does not use up a significant amount of chip space and also provides redundancy if the master unit in another core fails.

In some examples, register bus 101 links cores. The main function of the register bus 101 is to transfer the necessary register information between cores, to set the configuration registers using register write commands, and to access the register information using register read commands. However, the register bus 101 may also be utilized to allow communication between a master unit and its slave units-e.g., allowing the master unit 140 to transmit tasks to the slave units 111, 121, and allowing the slave units 111, 121 to transmit credit notifications and task completion updates to the master unit 140. The use of the register bus 101 in this manner eliminates the need for dedicated connections between cores, thereby saving chip space. Because of the small size of these communications, master-slave communications may be made using the register bus 101. For example, when master unit 140 transmits a subset task to a core instead of encoding all the information needed to process the task in a register write command, master unit 140 may instead only provide enough information to slave units to address the necessary information. An example of this is the transfer of addresses in memory of data to be processed to slave units.

To communicate master-slave using the register bus 101, the master unit transmitting 240, 250 the first subset and the second subset may include outputting (step 1000) first and second register write commands, wherein the first register write command includes an indication of a first subset of tasks and the second register write command includes an indication of a second subset of tasks (see fig. 10). The master unit may also output a subsequent register write command including an indication of a subsequent assigned task. Output 1000 may include transmitting a register write command directly to a slave unit via register bus 101 or outputting a command to other hardware units in third core 130 for transmission.

The master unit 140 may address the register write command to the core to which the master unit has assigned the subset task contained in the write command. In other words, the master unit 140 may output a first register write command addressed to the first core and including an indication of the first subset of tasks, and a second register write command addressed to the second core and including an indication of the second subset of tasks. Depending on the number of tasks in each subset, more than one register write command may be required to transfer the subset. In some cases, such as transmitting a subsequent task, the register write command may include an indication of only a single task. In any case, each register write command is transmitted over the register bus 101 and received by the core to which it is addressed. The cores 110, 120 may in turn pass the data to the slave units 111, 121 instead of writing the data in the command to registers as would normally occur. When each core includes a plurality of slave units, a register write command may be addressed to a particular slave unit in a particular core, and a receiving core may pass data contained within the command to the slave unit to which the command is addressed. In this way, separation of the first and second types of tasks in the first and second master and slave units (as described above) may be maintained.

When the slave of the core is configured to transmit credit notifications, task completion updates, and/or CFI notifications, these contents may be in the form of register write commands or register read commands addressed to the active master. When there are multiple active master units, each slave unit may transmit communications to the master unit from which the slave unit receives tasks. Similarly, the master unit may address communications to particular slave units within a particular core.

When using the register bus 101 to transfer master-slave communications, it may be useful to reserve a series of register addresses for master-slave communications. The reserved register address is a register address that the core is configured to not use for conventional register read/write commands (commands intended to actually access the registers). Thus, when a core receives a register read/write command addressed to a reserved register address, rather than simply reading/writing data from/to the register, it can determine that this is a master-slave communication and pass the data to the master or slave of the core. Each core may be associated with at least one reserved register address such that the address indicates to which core (and optionally which slave unit in the core) the communication is addressed, and is a master-slave communication, rather than a normal register read or write command. If the register read/write command does not use the reserved register address, the core may treat it as a regular register read/write command and read/write data from/to the register. Communications sent from the slave units, such as credit notifications and task completion updates, may also be addressed to the reserved register address, and the third core 130 may send these communications only to the master unit 140 when they are addressed to the reserved register address. By the slave addressing the credit notification and/or task completion update to a different reserved register address, master 140 may know which core (and which slave in the core) sent the credit notification and/or task completion update.

In some examples, the core that includes the active master (or each core that includes the active master) may include an arbitration unit. An example of this is depicted in fig. 11, where the third core 130 of the GPU 1100 includes an active master unit 140 and an arbitration unit 1130 in communication with the master unit 140 and the slave unit 131. An exemplary method performed by the GPU 1100 is depicted in fig. 12. The arbitration unit 1130 may receive 1200 a register write command output 1000 by the master unit 140 and send 1210 the write command (or data contained therein) to the slave unit 131 of the core containing the master unit if the write command is addressed to the slave unit 131 of the core containing the master unit (in this example, the third core 130). If the register write command is not addressed to a slave unit in the core that includes the master unit, then arbitration unit 1130 may forward 1220 the write command to register bus 101 for transmission to the core to which the command is addressed. This may mean forwarding the command directly to the register bus 101 or indirectly to the register bus 101 via another unit in the core that includes the master unit.

In the case of a core comprising a plurality of master units and/or a plurality of slave units, its arbitration unit may communicate with each of these master units and slave units. For example, if the third core 930 of the GPU 900 of fig. 9 includes an arbitration unit, it may receive register write commands from both master units 941, 942 and send any commands (or data contained therein) to the appropriate slave units within the core 930 or forward the commands to the register bus 101 when they are not addressed to any of the slave units of the third core 930.

Similarly, the first core 110 and the second core 120 (and more generally, cores that do not include an active master) may each include an arbitration unit that may receive register write commands sent by cores having active masters. In each core, the arbitration unit may send the received register write command (or data contained therein) to the slave unit to which the write command is addressed. This communication (optionally in the form of a register read/write command) may be output to the arbitration unit of the core, in case the slave unit is configured to transmit a credit notification, a task completion update, or a CFI notification. The arbitration unit may forward the command to the register bus 101 for transmission to the relevant active master unit, or may send the read/write command directly to the relevant active master unit if the relevant active master unit is on the same core as the arbitration unit. When the arbitration unit forwards the read/write command to the register bus 101, it may forward the command directly to the register bus 101 or indirectly to the register bus 101 via another unit in the core. Alternatively, the slave units may transmit communications directly over the register bus 101.

The arbitration unit 1130 of the core including the active master unit may be configured to receive the register read/write commands transmitted by the slave units of the core and to send the commands to the active master unit.

As depicted in fig. 13, first core 110, second core 120, and third core 130 of GPU 1300 may each include interface units 1310, 1320, 1330. Each interface unit is connected to the register bus 101 and can communicate with the master and slave units of its core. The interface unit is configured to receive register read/write commands from the hardware units within the core and to transmit these commands over the register bus 101. Similarly, they are also configured to receive register read/write commands from the register bus 101 and forward them to the master or slave within the core. An exemplary method performed by GPU 1300 will be explained with reference to fig. 14A.

The initial stages (210-230) of the method are the same as those explained in fig. 2. The interface unit 1330 of the third core 130 may receive 1400 a register write command output 1000 by the main unit 140 and transmit 1410 a first command to the first core 110 and a second command to the second core 130. The interface unit 1310 of the first core 110 receives 1420 a first register write command and the interface unit 1320 of the second core 120 receives 1430 a second register write command. The interface unit 1310 of the first core 110 forwards 1440 the first register write command to the slave unit 111 and the interface unit 1320 of the second core 120 forwards 1450 the second register write command to the slave unit 121. Forwarding to a slave unit may mean sending directly to the slave unit or indirectly to the slave unit via another hardware unit, e.g. via an arbitration unit.

The interface units 1310, 1320 may each, upon receiving a register write command, determine 1425, 1435 whether the register write command is addressed to a reserved register address or to an unreserved register address (see fig. 14B). If the register write command is addressed to the reserved register address, the interface unit may recognize that this is a master-slave communication and forward 1440, 1450 the register write command to the slave of the core. Otherwise, the interface unit will treat the register write command as a regular command to write data to registers in the core.

In some examples, the first core 110, the second core 120, and the third core 130 (and more generally, all cores of the GPU) each include both an interface unit and an arbitration unit. An example of this is shown in fig. 15. In fig. 15, each core 110, 120, 130 of the GPU 1500 includes: the main units 140, 141, 142; slave units 131, 132, 133; arbitration units 1110, 1120, 1130 and interface units 1310, 1320, 1330. Each interface unit 1310, 1320, 1330 is connected to the register bus 101 and its core's arbitration units 1110, 1120, 1130. Each arbitration unit 1110, 1120, 1130 communicates with its core's slave units 111, 121, 131 and its core's master units 140, 141, 142. As explained above, because each core includes only a single slave unit, only one master unit in GPU 1500 is active. In this example, it is the master unit 140 of the third core 130. Inactivity of the master units 141, 142 of the first core 110 and the second core 120 is represented by diagonal shading.

Fig. 16 is a flowchart depicting an exemplary method performed by GPU 1500 of fig. 15. For the purposes of this example, it is assumed that the method steps prior to transmitting 240, 250 the first subset and the second subset are the same as in fig. 2.

In step 1000, the master unit 140 outputs 1000 a first register write command and a second register write command. In step 1200, the arbitration unit 1130 receives a first register write command and a second register write command. Since none of the commands is addressed to a slave unit within the third core 130, the arbitration unit 1130 forwards 1220 the first register write command and the second register write command, sending them to the interface unit 1330 of the third core 130. The interface unit 1330 transmits 1410 a first register write command to the first core 110 and a second register write command to the second core 120 through the register bus 101. The interface unit 1310 of the first core 130 receives 1420 the first register write command and, upon determining 1425 that it is addressed to a reserved register address, forwards 1440 the first register write command to the arbitration unit 1110. The arbitration unit 1110 forwards 1600 the first register write command (or data contained therein) to the slave unit 111 of the first core 110 (to which it is addressed). Similarly, the interface unit 1320 of the second core 120 receives 1430 the second register write command and forwards 1450 the second register write command to the arbitration unit 1120 when it is determined 1435 that it is addressed to the reserved register address. The arbitration unit 1120 then forwards 1610 the second register write command (or data contained therein) to the slave unit of the second core 120.

Any communications sent by the slave units 111, 121 of the first core 110 or the second core 120 may be sent to the master unit 140 in a similar manner. By way of example, the communications sent by the slave units may include credit notifications, task completion updates, and CFI notifications. The communication may be in the form of a register read or write command and may be addressed to a reserved register address associated with master unit 140. For example, the slave unit 111 of the first core 110 may output a credit notification when the first core 110 completes a task. The arbitration unit 1110 may receive this credit notification and, upon determining that it is not addressed to the master unit 141 of the first core 110, may forward the credit notification to the interface unit 1310. Interface unit 1310 may transmit a credit notification to master unit 140. Interface unit 1330 of third core 130 may receive the credit notification and may forward it to arbitration unit 1130 when it is determined that it is addressed to the reserved register address. Arbitration unit 1130 may then forward the credit notification to master unit 140.

When the slave 131 of the third core 130 transmits a communication addressed to the master 140 of the same core, this communication may be routed by the arbitration unit 1130 to the master 140 without being forwarded to the interface unit 1330 or the register bus 101.

It should be appreciated that just as the methods of fig. 12 and 14A can be combined to produce the method depicted in fig. 15, any combination of the other methods disclosed above is also effective. For example, fig. 17 depicts a composite method including the steps of fig. 5, 8, and 15, and the composite method may be performed on the GPU of fig. 16, or on a similar GPU lacking master units 141 and 142.

It should be noted that some method steps depicted as overlapping are indicated by diagonal lines between the reference numerals. For example, steps 500 and 640 overlap. This means that the subsequent task assigned to the core with the least work (step 500) may also be the dependent task mentioned in step 640.

In some examples, the subsequent task assigned in step 500 is the first task. In this case, the master unit 140 may insert a task completion update command and a memory refresh command as described above with respect to the first task. This is depicted by cycling through the method with dashed arrows.

It should be appreciated that transmission 520/650 may include all of the same steps as transmission 240 or 250.

The method depicted in fig. 17 may be performed by the GPU, following the various loops shown, until there are no subsequent tasks to assign and transmit.

In any of the examples above, each core of the GPU may be identical. This means that each core may comprise the same number of master units and slave units, as well as arbitration units and interface units. Further, each master unit may be identical, each slave unit may be identical, each arbitration unit may be identical, and each interface unit may be identical.

Most of the examples described above have referred to GPUs that include at least three cores, with the third core including an active master unit. However, it should be understood that the features described in these examples may be generalized to other GPUs having two or more cores, and where one of the first core and the second core includes an active master unit.

FIG. 18 illustrates a computer system in which a graphics processing system described herein may be implemented. The computer system includes a CPU 1902, a GPU 1904, a memory 1906, and other devices 1914, such as a display 1916, speakers 1918, and a camera 1919. Processing block 1910 (corresponding to cores 110, 120, 130 and register bus 101) is implemented on GPU 1904. In other examples, processing block 1910 may be implemented on CPU 1902. The components of the computer system may communicate with each other via a communication bus 1920.

The GPUs of fig. 1, 3, 7, 9, 11, 13 and 15 are shown to include many functional blocks. This is merely illustrative and is not intended to limit the strict division between the different logic elements of such entities. Each of the functional blocks may be provided in any suitable manner. It should be understood that intermediate values described herein as being formed by a GPU need not be physically generated by the GPU at any point in time, and may merely represent logical values that conveniently describe the processing performed by the GPU between its inputs and outputs.

The GPUs described herein may be embodied in hardware on an integrated circuit. The GPUs described herein may be configured to perform any of the methods described herein. In general, any of the functions, methods, techniques, or components described above may be implemented in software, firmware, hardware (e.g., fixed logic circuitry) or any combination thereof. The terms "module," "functionality," "component," "element," "unit," "block," and "logic" may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs specified tasks when executed on a processor. The algorithms and methods described herein may be executed by one or more processors executing code that causes the processors to perform the algorithms/methods. Examples of a computer-readable storage medium include Random Access Memory (RAM), read-only memory (ROM), optical disks, flash memory, hard disk memory, and other memory devices that can store instructions or other data using magnetic, optical, and other techniques and that can be accessed by a machine.

The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for execution by a processor, including code expressed in machine language, an interpreted language, or a scripting language. Executable code includes binary code, machine code, byte code, code defining an integrated circuit (such as a hardware description language or netlist), and code expressed in programming language code such as C, java or OpenCL. The executable code may be, for example, any kind of software, firmware, script, module, or library that, when properly executed, handled, interpreted, compiled, run in a virtual machine or other software environment, causes the processor of the computer system supporting the executable code to perform the tasks specified by the code.

The processor, computer, or computer system may be any kind of device, machine, or special purpose circuit, or a set or portion thereof, that has processing capabilities such that it can execute instructions. The processor may be any kind of general purpose or special purpose processor, such as CPU, GPU, NNA, a system on a chip, a state machine, a media processor, an Application Specific Integrated Circuit (ASIC), a programmable logic array, a Field Programmable Gate Array (FPGA), or the like. The computer or computer system may include one or more processors.

The present invention is also intended to cover software defining the configuration of hardware as described herein, such as Hardware Description Language (HDL) software, for designing integrated circuits or for configuring programmable chips to achieve desired functions. That is, a computer readable storage medium may be provided having encoded thereon computer readable program code in the form of an integrated circuit definition data set that, when processed (i.e., run) in an integrated circuit manufacturing system, configures the system to manufacture a GPU or graphics processing system configured to perform any of the methods described herein, or to manufacture a GPU or graphics processing system comprising any of the devices described herein. The integrated circuit definition data set may be, for example, an integrated circuit description.

Accordingly, a method of manufacturing a GPU or graphics processing system as described herein at an integrated circuit manufacturing system may be provided. Furthermore, an integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, causes a method of manufacturing a GPU or a graphics processing system to be performed.

The integrated circuit definition data set may be in the form of computer code, for example, as a netlist, code for configuring a programmable chip, as a hardware description language defining a device suitable for fabrication at any level in an integrated circuit, including as Register Transfer Level (RTL) code, as a high-level circuit representation (e.g., verilog or VHDL), and as a low-level circuit representation (e.g., OASIS (RTM) and GDSII). A higher-level representation (e.g., RTL) that logically defines hardware suitable for fabrication in an integrated circuit may be processed at a computer system configured to generate fabrication definitions for the integrated circuit in the context of a software environment that includes definitions of circuit elements and rules for combining these elements to generate fabrication definitions for the integrated circuit so defined by the representation. As is typically the case when software is executed at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate a manufacturing definition for an integrated circuit to execute code that defines the integrated circuit to generate the manufacturing definition for the integrated circuit.

An example of processing an integrated circuit definition data set at an integrated circuit manufacturing system to configure the system to manufacture a GPU or a graphics processing system will now be described with reference to fig. 19.

Fig. 19 illustrates an example of an Integrated Circuit (IC) manufacturing system 2002 configured to manufacture a GPU or graphics processing system as described in any of the examples herein. In particular, the IC fabrication system 2002 includes a layout processing system 2004 and an integrated circuit generation system 2006. The IC manufacturing system 2002 is configured to receive an IC definition data set (e.g., defining a GPU or graphics processing system as described in any of the examples herein), process the IC definition data set, and generate an IC (e.g., that contains a GPU or graphics processing system as described in any of the examples herein) from the IC definition data set. Processing of the IC definition data set configures the IC fabrication system 2002 to fabricate an integrated circuit that includes a GPU or graphics processing system as described in any of the examples herein.

Layout processing system 2004 is configured to receive and process the IC definition data set to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art and may involve, for example, synthesizing RTL codes to determine a gate level representation of a circuit to be generated, for example in terms of logic components (e.g., NAND, NOR, AND, OR, MUX and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout may be determined from the gate level representation of the circuit. This may be done automatically or with the participation of a user in order to optimize the circuit layout. When the layout processing system 1004 has determined a circuit layout, it may output the circuit layout definition to the IC generation system 1006. The circuit layout definition may be, for example, a circuit layout description.

As is known in the art, the IC generation system 2006 generates ICs according to a circuit layout definition. For example, the IC generation system 2006 may implement a semiconductor device fabrication process to generate ICs, which may involve a multi-step sequence of photolithography and chemical processing steps during which electronic circuits are gradually formed on a wafer made of semiconductor material. The circuit layout definition may be in the form of a mask that may be used in a lithographic process to generate an IC from the circuit definition. Alternatively, the circuit layout definitions provided to the IC generation system 2006 may be in the form of computer readable code that the IC generation system 2006 can use to form a suitable mask for generating the IC.

The different processes performed by the IC fabrication system 2002 may all be implemented at one location, e.g., by a party. Alternatively, the IC manufacturing system 2002 may be a distributed system, such that some processes may be performed at different locations and by different parties. For example, some of the following phases may be performed at different locations and/or by different parties: (i) Synthesizing an RTL code representing the IC definition dataset to form a gate level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate level representation; (iii) forming a mask according to the circuit layout; and (iv) using the mask to fabricate the integrated circuit.

In other examples, processing of the integrated circuit definition data set at the integrated circuit manufacturing system may configure the system to manufacture a GPU or a graphics processing system without processing the IC definition data set to determine a circuit layout. For example, an integrated circuit definition dataset may define a configuration of a reconfigurable processor, such as an FPGA, and processing of the dataset may configure the IC manufacturing system to generate (e.g., by loading configuration data into the FPGA) the reconfigurable processor having the defined configuration.

In some embodiments, the integrated circuit manufacturing definition data set, when processed in the integrated circuit manufacturing system, may cause the integrated circuit manufacturing system to produce a device as described herein. For example, configuring an integrated circuit manufacturing system in the manner described above with reference to fig. 19, via an integrated circuit manufacturing definition dataset, may result in the manufacture of an apparatus as described herein.

In some examples, the integrated circuit definition dataset may contain software running on or in combination with hardware defined at the dataset. In the example shown in fig. 19, the IC generation system may also be configured by the integrated circuit definition data set to load firmware onto the integrated circuit in accordance with program code defined at the integrated circuit definition data set at the time of manufacturing the integrated circuit or to otherwise provide the integrated circuit with program code for use with the integrated circuit.

Implementation of the concepts set forth in the present application in apparatuses, devices, modules, and/or systems (and in methods implemented herein) may result in performance improvements over known embodiments. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During the manufacture of such devices, apparatus, modules and systems (e.g., in integrated circuits), a tradeoff may be made between performance improvement and physical implementation, thereby improving the manufacturing method. For example, a tradeoff may be made between performance improvement and layout area, matching the performance of known embodiments, but using less silicon. This may be accomplished, for example, by reusing the functional blocks in a serial fashion or sharing the functional blocks among elements of an apparatus, device, module, and/or system. In contrast, the concepts described herein that lead to improvements in the physical implementation of devices, apparatus, modules and systems (e.g., reduced silicon area) may be weighed against performance improvements. This may be accomplished, for example, by fabricating multiple instances of the module within a predefined area budget.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the application.

Claims

1. A graphics processing unit (100) comprising a plurality of cores, wherein each core of the plurality of cores comprises a slave unit (111, 121, 131) configured to manage execution of image rendering tasks within the core, and wherein at least one core of the plurality of cores further comprises a master unit (140) configured to:

receiving a set of image rendering tasks;

assigning a first subset of image rendering tasks to a first core (110) of the plurality of cores;

assigning a second subset of image rendering tasks to a second core (120) of the plurality of cores;

transmitting the first subset of image rendering tasks to a slave unit (111) of the first core (110); and

transmitting the second subset of image rendering tasks to a slave unit (121) of the second core (120),

wherein:

the slave unit (111) of the first core (110) is configured to transmit a first credit notification to the master unit (140) when a task of the first subset of image rendering tasks has been processed;

the slave unit (121) of the second core (120) is configured to transmit a second credit notification to the master unit (140) when a task of the second subset of image rendering tasks has been processed; and is also provided with

The master unit (140) is configured to:

Storing credit numbers for each of the first and second cores (110, 120);

when the master unit (140) assigns the first subset of image rendering tasks to the first core (110), for each of the first subset of image rendering tasks, adjusting the credit number of the first core (110) by a first amount;

when the master unit (140) assigns the second subset of image rendering tasks to the second core (120), for each of the second subset of image rendering tasks, adjusting the credit number of the second core (120) by the first amount;

-adjusting the credit number of the first core (110) by a second amount when the primary unit (140) receives the first credit notification; and

when the master unit (140) receives the second credit notification, the credit number of the second core (120) is adjusted by the second amount, wherein one of the first amount and the second amount is positive and the other is negative.

2. The graphics processing unit (100) according to claim 1, wherein the main unit (140) is configured to:

assigning a subsequent image rendering task to a slave unit of the core currently assigned least work based on the credit number of each of the cores;

Adjusting credit numbers of cores assigned to the subsequent image rendering tasks by the first amount; and

the subsequent image rendering task is transmitted to the slave unit of the core to which it has been assigned.

3. The graphics processing unit (100) of any of the preceding claims, wherein each of the plurality of cores comprises a second slave unit configured to manage execution of image rendering tasks of a second type by the cores, and wherein one of the cores comprises a second master unit configured to:

receiving a second set of image rendering tasks of the second type;

assigning a first subset of the second set of image rendering tasks to a first core of the plurality of cores;

assigning a second subset of the second set of image rendering tasks to a second, different core of the plurality of cores;

transmitting a first subset of the second set of image rendering tasks to a second slave unit of a first core of the plurality of cores (110, 120, 130); and

a second subset of the second set of image rendering tasks is transmitted to a second slave unit of a second core of the plurality of cores.

4. The graphics processing unit (100) according to any one of the preceding claims, wherein:

The master unit (140) is configured to output a first register write command and a second register write command;

the first register write command is addressed to the first core (110) and includes an indication of the first subset image rendering task, and the second register write command is addressed to the second core (120) and includes an indication of the second subset image rendering task; and is also provided with

The plurality of cores are connected by a register bus (101) configured to transfer register write commands between the cores.

5. The graphics processing unit (1110) of claim 4 wherein the core comprising at least the master unit (140) further comprises an arbitration unit (1110) in communication with the master unit (140) and the slave unit (131) of the core, wherein the arbitration unit (1110) is configured to:

-receiving the register write command from the master unit (140); and

for each register write command:

if the register write command is addressed to a core comprising the master unit, passing the register write command to a slave unit (131) comprising the core of the master unit; and

if the register write command is not addressed to a core comprising the master unit, the register write command is forwarded for transmission over the register bus (101).

6. The graphics processing unit (1300) of any of claims 4 and 5, wherein the plurality of cores each include an interface unit (1310, 1320, 1330) in communication with the register bus (101), wherein the interface unit (1330) including the core of the master unit is configured to:

receiving the first register write command and the second register write command; and

transmitting the first register write command to the first core (110) and the second register write command to the second core (120) over the register bus (101),

wherein the interface unit (1310) of the first core (110) is configured to:

-receiving the first register write command via the register bus (101); and

forwarding the first register write command to a slave unit (111) of the first core (110),

and wherein the interface unit (1320) of the second core (120) is configured to:

-receiving the second register write command via the register bus (101); and

-forwarding the second register write command to a slave unit (121) of the second core (120).

7. The graphics processing unit (1300) of claim 6, wherein the interface unit (1310) of the first core (110) is configured to:

Determining whether the first register write command is addressed to a first reserved register address; and

if the first register write command is addressed to the first reserved register address, forwarding the first register write command to a slave unit (111) of the first core (110), and

wherein the interface unit (1320) of the second core (120) is configured to:

determining whether the second register write command is addressed to a second reserved register address; and

if the second register write command is addressed to the second reserved register address, the second register write command is forwarded to a slave unit (121) of the second core (120).

8. A method (200) of transmitting an image rendering task in a graphics processing unit (100) comprising a plurality of cores, the method comprising:

receiving (210), by a master unit (140) in one of the plurality of cores, a set of image rendering tasks;

assigning (220), by the master unit (140), a first subset of image rendering tasks to a first core (110) of the plurality of cores;

assigning (230), by the master unit (140), a second subset of image rendering tasks to a second core (120) of the plurality of cores;

-transmitting (240), by the master unit (140), the first subset image rendering task to a slave unit (111) of the first core (110);

-transmitting (250), by the master unit (140), the second subset image rendering task to a slave unit (121) of the second core (120);

-storing (400), by the master unit (140), a credit number for each of the first and second cores (110, 120);

-adjusting (410), by the master unit (140), for each of the first subset of image rendering tasks, a credit number of the first core (110) by a first amount; and

-adjusting (420), by the master unit (140), for each of the second subset of image rendering tasks, a credit number of the second core (120) by the first amount;

transmitting (430), by a slave unit (111) of the first core (110), a first credit notification to the master unit (140) when a task of the first subset of image rendering tasks has been processed;

transmitting (440), by a slave unit (121) of the second core (120), a second credit notification to the master unit (140) when a task of the second subset of image rendering tasks has been processed;

-adjusting (450), by the master unit (140), a credit number of the first core (110) by a second amount when the master unit (140) receives the first credit notification; and

-adjusting (460), by the master unit (140), a credit number of the second core (120) by the second amount when the master unit (140) receives the second credit notification, wherein one of the first amount and the second amount is positive and the other is negative.

9. The method of claim 8, further comprising:

assigning (500), by the master unit (140), a subsequent image rendering task to a slave unit of a core currently assigned least work based on the credit numbers of each of the cores;

-adjusting (510), by the master unit (140), by the first amount, a credit number of a core to which the subsequent image rendering task has been assigned; and

the subsequent image rendering task is transmitted (520) by the master unit (140) to the slave unit of the core to which it has been assigned.

10. The method according to any one of claims 8 to 9, wherein the method comprises:

-including (600) a task completion update command after a first task in the first subset of image rendering tasks by the master unit (140);

-processing (610) the first task by the first core (110);

-processing (620), by the first core (110), the task completion update command; and

transmitting (630), by a slave unit (111) of the first core (110), a task completion update to the master unit (140);

-assigning (640), by the master unit (140), dependent tasks of the first task to one of the slave units (111, 121, 131) of the first and second cores (110, 120); and

the dependent tasks are transmitted (650) by the master unit (140) to the cores to which they have been assigned.

11. The method of claim 10, further comprising:

-including (800) a memory refresh command in the first subset image rendering task by the master unit (140) after the first task and optionally before the task completes an update command;

-processing (810), by the first core (110), the memory refresh command; and

-writing (820) all output data stored in the first core (110) by a slave unit (111) of the first core (110) to a shared memory (710).

12. A method of manufacturing a graphics processing unit according to any one of claims 1 to 7 using an integrated circuit manufacturing system, the method comprising:

processing the computer readable description of the graphics processing unit using a layout processing system to generate a circuit layout description of an integrated circuit containing the graphics processing unit; and

the graphics processing unit is manufactured from the circuit layout description using an integrated circuit generation system.

13. A computer readable storage medium having stored thereon computer readable code configured to cause the method according to any of claims 8 to 11 to be performed when the code is run.

14. A computer readable storage medium having stored thereon a computer readable description of a graphics processing unit according to any of claims 1 to 7, which when processed in an integrated circuit manufacturing system causes the integrated circuit manufacturing system to:

15. An integrated circuit manufacturing system, comprising:

a non-transitory computer readable storage medium having stored thereon a computer readable description of a graphics processing unit according to any one of claims 1 to 7;

a layout processing system configured to process the computer readable description to generate a circuit layout description of an integrated circuit containing the graphics processing unit; and

An integrated circuit generation system configured to manufacture the graphics processing unit in accordance with the circuit layout description.