CN118247127A

CN118247127A - Applying texture processing to fragment blocks in a graphics processing unit

Info

Publication number: CN118247127A
Application number: CN202311768584.2A
Authority: CN
Inventors: A·德梅尔; W·托马斯; A·霍夫曼; A·巴拉贝斯
Original assignee: Imagination Technologies Ltd
Current assignee: Imagination Technologies Ltd
Priority date: 2022-12-23
Filing date: 2023-12-19
Publication date: 2024-06-25
Also published as: GB2625800B; GB2625800A8; GB202219717D0; GB2625800A

Abstract

The present invention relates to applying texture processing to segment blocks in a graphics processing unit. A method and Graphics Processing Unit (GPU) for applying texture filtering to a block of segments, each of the segments associated with texture coordinates for each of a plurality of dimensions of a texture is provided. The texture coordinates for the segment of the block are detected to be axis aligned. Two or more integer texel coordinates are determined for each texture coordinate in the texture coordinate set. A uniqueness process is performed on the determined integer texel coordinates to remove one or more duplicate integer texel coordinates and thereby determine a subset of the determined integer texel coordinates. A texel address of the texel to be extracted is generated using the subset of the determined integer texel coordinates. The texels are extracted using the generated texel addresses. For each of the segments of the block, a filtered value is determined by applying filtering to the subset of extracted texels. And outputting the filtering value.

Description

Applying texture processing to fragment blocks in a graphics processing unit

Technical Field

The present disclosure relates to techniques for applying texture processing (e.g., texture filtering) to segment blocks in a Graphics Processing Unit (GPU).

Background

In computer graphics, texturing is often used to add surface detail to objects to be rendered within a scene, or to apply post-processing to existing images. Textures are typically stored as images that are accessed to return color values for the processed segments. In computer graphics, a 2D rendering space is used to render a scene that includes primitives that represent objects in the scene. The 2D rendering space includes an array of sample locations, and a "fragment" refers to a discrete point on a primitive at a sample location. There may or may not be a 1:1 relationship between the sample location and the pixel location of the rendered image, e.g., if scaling or antialiasing is being implemented, the relationship may not be 1:1. To obtain the texture color value of the segment, the values of a plurality of texels of the texture may be sampled and the sampled texel values may then be filtered to obtain the final texture value of the segment. A Graphics Processing Unit (GPU) may include a Texture Processing Unit (TPU) that is typically used to extract and filter texels of a texture to provide texture values to a fragment processing unit, for example, for: (i) Applying a visual effect (e.g. color) to the surface of the geometric model during 3D rendering (which may involve tri-linear and/or anisotropic filtering), and (ii) post-processing to apply the visual effect to the existing image. The present disclosure relates generally to some cases where TPU is used for post-processing and for applying texture filtering during rendering, e.g., for rendering 2D images or for rendering Graphical User Interfaces (GUIs). The term "post-processing" as used herein refers to applying a certain process to pixel values of an existing image (e.g., an image that has been rendered by a GPU), and in these cases, the pixel values of the existing image may be read back into the GPU as texels of the texture before being processed and applied to a fragment of the new post-processed image. Examples of post-processing procedures include tone mapping, applying depth of field effects, applying floodlight to an image, magnification, and many different kinds of blurring procedures (e.g., gaussian (Gaussian) blurring).

In general, a single segment (e.g., corresponding to a single pixel of an image) to which texture processing is to be applied typically does not map exactly to a single texel of the texture, e.g., due to the projection of the texture onto a 3D geometry within the image. There may be differences in alignment or scaling, which may be handled using interpolation/filtering or mipmapping, respectively. In some cases, anisotropic texture filtering may be performed. When anisotropic texture filtering is applied, the sampling kernel in the texture space mapped to a segment or pixel in screen space is elongated along a particular axis in texture space, where the direction of this axis depends on the mapping between screen space and texture space. This is schematically shown in fig. 1, which shows an image 100 formed by pixels having coordinates defined in image space (according to screen space axes 'X' and 'Y' as shown in fig. 1) and a texture 102 formed by texels having coordinates defined in texture space (according to texture space axes 'U' and 'V' as shown in fig. 1). The image 100 includes an object 104 whose surface details are specified by the texture 102, i.e., the texture 102 maps to the surface of the object 104. The object 104 is at an oblique viewing angle within the image 100. As described above, if texture is applied to a geometrical selection that is at an oblique angle with respect to the viewing direction, the isotropic coverage of segments or pixels in image space maps to the anisotropic coverage in texture space. Numeral 106 represents the coverage area of a segment (corresponding to a pixel in image space) that is circular, and numeral 108 represents the corresponding segment coverage area in texel space. It can be seen that the coverage area has been elongated (in a direction that is neither parallel to the U nor V axis) in texture space to form an ellipse, such that the coverage area is anisotropic. In general, the mapping of a fragment having a circular coverage area in image space to texture space can be approximated by an ellipse, as long as the texture mapping itself can be approximated by an affine mapping of pixel origin.

In the example shown in fig. 1, the texture coordinates associated with the segments are not axis aligned. In other words, when texture 102 is applied to an object 104 in image space, the U and V texture space axes are not aligned with the X and Y screen space axes. However, in other examples, the texture coordinates associated with the segments are aligned axially such that when the texture 102 is applied to the object 104 in image space, the U and V texture space axes are aligned with the X and Y screen space axes.

Texture processing units are typically configured to be able to apply different types of texture processing, rather than being dedicated to and optimized for performing only one of these types of texture processing. The different types of texture processing may include different types of texture filtering (e.g., point sampling, bilinear interpolation, anisotropic texture filtering, tri-linear filtering, etc.), different types of addressing modes (e.g., stride, rotation, etc.), different types of textures (e.g., 1D texture, 2D texture, 3D texture, and cube map), LOD computation, decompression of texture data, color space conversion, and/or gamma correction. Furthermore, texture processing is an expensive process to implement on the GPU (e.g., in terms of latency, power consumption, and/or silicon area). When designing GPUs, a trade-off is typically made between latency, power consumption, and silicon area, where it is generally desirable to reduce latency, reduce power consumption, and reduce the silicon area of the GPU. One of these three factors (latency, power consumption, silicon area) can typically be reduced by increasing one or both of the other two factors. It would be beneficial to reduce one of these factors without having to increase one of the other factors.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

There is provided a method of applying texture filtering to a block of segments in a Graphics Processing Unit (GPU), each of the segments being associated with texture coordinates for each of a plurality of dimensions of a texture, the method comprising:

detecting that texture coordinates for a segment of a block are axis aligned;

Determining two or more integer texel coordinates for each texture coordinate in the texture coordinate set;

Performing a uniqueness process on the determined integer texel coordinates to remove one or more duplicate integer texel coordinates and thereby determine a subset of the determined integer texel coordinates;

generating a texel address of the texel to be extracted using the determined subset of integer texel coordinates;

Extracting texels using the generated texel address;

for each of the segments of the block, determining a filtered value by applying filtering to the subset of extracted texels; and

And outputting a filtered value.

The method may further include performing a de-uniqueness process on the extracted texels to determine which of the extracted texels are included in a subset for each of the fragments of the block.

The texture filtering may be bilinear filtering, wherein the determining two or more integer texel coordinates for each texture coordinate in the set of texture coordinates may include accurately determining two integer texel coordinates for each texture coordinate in the set of texture coordinates, and

Wherein for each of the segments of the block, the subgroup may comprise four of the extracted texels, and the filtered value for the segment may be determined by determining the result of bilinear interpolation of those four extracted texels.

For each of the segments of the block, four pairs of integer texel coordinates may correspond to four texel addresses of the four extracted texels of the subset.

The two integer texel coordinates determined for each of the texture coordinates may be: (i) A first integer texel coordinate corresponding to a texture coordinate rounded down to an integer texel position; and (ii) a second integer texel coordinate, the second integer texel coordinate being an integer texel coordinate greater than the first integer texel coordinate.

The level of detail of texture filtering may correspond to a 1:1 mapping between the pitch of the segments in the block and the pitch of the texels in the texture.

The segment block may be an mxn segment block, wherein the texture may be a 2D texture such that each segment is associated with texture coordinates for a horizontal dimension and texture coordinates for a vertical dimension, and wherein the determined subset of integer texel coordinates may include n+1 integer texel coordinates for the horizontal dimension and m+1 integer texel coordinates for the vertical dimension.

The extracted texels may represent a (m+1) x (n+1) texel block of the texture.

For example, n=m=4, and the graphics processing unit may include 32 address generators, where each address generator may generate a texel address of a texel to be extracted within each clock cycle, and where 25 of the address generators may be used to generate a texel address of a texel in a 5x5 texel block within one clock cycle.

In other examples, m=4 and n=2, or m=2 and n=4, and the graphics processing unit may include 16 address generators, wherein each address generator may generate a texel address of a texel to be extracted within each clock cycle, and wherein 15 of the address generators may be used to generate a texel address of a texel in a 5x3 or 3x5 texel block within one clock cycle.

The method may further include determining a fractional portion of the texel position corresponding to each of the texture coordinates of the set, wherein the horizontal interpolation weight for the bilinear interpolation of the segments may be based on the determined fractional portion of the texel position corresponding to the texture coordinate associated with the segment for the horizontal dimension, and wherein the vertical interpolation weight for the bilinear interpolation of the segments may be based on the determined fractional portion of the texel position corresponding to the texture coordinate associated with the segment for the vertical dimension.

The method may further comprise: before generating the texel address, the determined fractional part of the texel position corresponding to the texture coordinate is detected as zero, and in response to this detection, a result of bilinear interpolation of four texels is determined that does not require determination of four texels for two texels of the bilinear interpolation of the segment associated with the texture coordinate.

In response to determining that there is a sufficient number of repetitions of the determined integer texel coordinates, a uniqueness process may be performed on the determined integer texel coordinates.

In some examples, m=n=4, and the determining that there is a sufficient number of repetitions of the determined integer texel coordinates may include determining whether all six of the following expressions are satisfied:

(u₀₊＝u_1-)∨(u₀₊＝u_2-)∨(u₀₊＝u_3-)∨(u₀₊＝u₃₊)

(u₁₊＝u_0-)∨(u₁₊＝u_2-)∨(u₁₊＝u_3-)∨(u₁₊＝u₃₊)

(u₂₊＝u_0-)∨(u₂₊＝u_1-)∨(u₂₊＝u_3-)∨(u₂₊＝u₃₊)

(v₀₊＝v_1-)∨(v₀₊＝v_2-)∨(v₀₊＝v_3-)∨(v₀₊＝v₃₊)

(v₁₊＝v_0-)∨(v₁₊＝v_2-)∨(v₁₊＝v_3-)∨(v₁₊＝v₃₊)

(v₂₊＝v_0-)∨(v₂₊＝v_1-)∨(v₂₊＝v_3-)∨(v₂₊＝v₃₊)

where u _i- and u _i+ are two integer texel coordinates in the horizontal dimension for each of the fragments in the ith column of the fragment block, where i e 0,1,2,3;

Where v _j- and v _j+ are two integer texel coordinates in the vertical dimension for each of the fragments in the j-th row of the fragment block, where j e 0,1,2,3; and

Wherein v represents a logical OR operation.

The integer texel coordinates of the 5x5 texel block may be:

The texture filtering may be a two-dimensional polynomial filtering using a polynomial having a degree d, where d >1, wherein said determining two or more integer texel coordinates for each texture coordinate in the texture coordinate set may comprise determining (d+1) integer texel coordinates for each texture coordinate in the texture coordinate set, and

Wherein for each of the segments of the block, the subgroup may comprise (d+1) ² of the extracted texels, and the filtered value for the segment may be determined by determining the result of a two-dimensional polynomial interpolation of that (d+1) ² extracted texels, the polynomial interpolation using a polynomial with degree d.

One or both of the horizontal and vertical dimensions of the texture may be flipped relative to the dimension of the segment block.

The pair of integer texel coordinates for the first segment of the block may be the same as the pair of integer texel coordinates for the second segment of the block, and wherein a texel address corresponding to the pair of integer texel coordinates may be generated once for processing the segment block due to the uniqueness process.

The uniqueness process may make all of the texel addresses generated for processing the segment blocks unique.

The texture coordinate set may be a reduced texture coordinate set comprising:

For each column segment in the block, only one texture coordinate for the horizontal dimension, and

For each row of segments in the block, only one texture coordinate for the vertical dimension.

The detecting that texture coordinates for segments of a block are axis aligned may include, for each of dimensions of a texture:

for each row of segments perpendicular to the dimension within the segment block, determining the texture coordinates for the dimension is the same for all of the segments within the row.

There is provided a graphics processing unit configured to apply texture filtering to a block of segments, each segment in a segment being associated with texture coordinates for each of a plurality of dimensions of a texture, the graphics processing unit comprising a segment processing unit and a texture processing unit,

Wherein the fragment processing unit is configured to:

detecting that texture coordinates for a segment of a block are axis aligned; and

Transmitting the texture coordinate set to a texture processing unit; and

Wherein the texture processing unit is configured to:

Extracting texels using the generated texel address;

And outputting a filtered value.

The texture processing unit may include:

A texture address generation module configured to: (i) determining two or more integer texel coordinates for each of the texture coordinates, (ii) performing a uniqueness process on the determined integer texel coordinates, and (iii) generating a texel address of the texel to be extracted using the determined subset of integer texel coordinates;

An address processing module configured to extract texels using the generated texel address; and

A texture filtering module configured to: (i) Determining a filtered value for each of the segments of the block by applying filtering to the subset of extracted texels, and (ii) outputting the filtered value.

A graphics processing unit may be provided that is configured to perform any of the methods described herein.

A method of applying texture processing to a block of segments in a Graphics Processing Unit (GPU) may be provided, each segment of a segment being associated with texture coordinates for each of a plurality of dimensions of a texture, the method comprising:

a fragment processing unit of the GPU detects that the texture coordinates for the fragment of the block are axis aligned;

Responsive to detecting that the texture coordinates for the segment of the block are axis aligned, sending a reduced set of texture coordinates to a texture processing unit of the GPU; and

The texture processing unit:

Processing the reduced texture coordinate set to generate a texel address of a texel to be extracted;

Extracting texels using the generated texel address;

Determining a processing value for each of the fragments of the block based on the extracted texels; and

And outputting the processing value.

A graphics processing unit may be provided that is configured to apply texture processing to a block of segments, each segment in a segment being associated with texture coordinates for each of a plurality of dimensions of a texture, the graphics processing unit comprising a segment processing unit and a texture processing unit,

Wherein the fragment processing unit is configured to:

detecting whether the texture coordinates for the segment of the block are axis aligned; and

Responsive to detecting that the texture coordinates for the segment of the block are axis aligned, sending a reduced set of texture coordinates to the texture processing unit; and

Wherein the texture processing unit is configured to:

Extracting texels using the generated texel address;

And outputting the processing value.

A method of retrieving a block of data items in a processor may be provided, each of the data items being associated with coordinates for each of a plurality of dimensions of a stored data array, the method comprising:

A data processing unit of the processor detecting that the coordinates associated with the data item of the block are axis aligned;

In response to detecting that the coordinates of the data items for the block are axis aligned, sending to a data loading unit of the processor only one coordinate of the first dimension for each row of data items aligned in a first dimension within the block, and only one coordinate of the second dimension orthogonal to the first dimension for each row of data items aligned in a second dimension within the block; and

The data loading unit:

processing the coordinates to generate an address of a data array element to be extracted from the stored data array;

Extracting data array elements from the stored data array using the generated address;

determining a data item value for each of the data items of the block based on the extracted data array elements; and

Outputting the data item value.

A processor may be provided that is configured to retrieve a block of data items, each of the data items being associated with coordinates for each of a plurality of dimensions of a stored data array, the processor comprising a data processing unit and a data loading unit,

Wherein the data processing unit is configured to:

detecting whether the coordinates associated with the data item of the block are axis aligned; and

In response to detecting that the coordinates associated with the data items of the block are axis aligned, sending to the data loading unit only one coordinate of the first dimension for each row of data items aligned in a first dimension within the block, and only one coordinate of the second dimension orthogonal to the first dimension for each row of data items aligned in a second dimension within the block; and

Wherein the data loading unit is configured to:

Outputting the data item value.

Data loading unit of processor:

determining two or more integer coordinates for each coordinate in the set of coordinates;

Performing a uniqueness process on the determined integer coordinates to remove one or more duplicate integer coordinates and thereby determine a subset of the determined integer coordinates;

generating an address of a data array element to be extracted from the stored data array using the determined subset of integer coordinates;

for each of the data items of the block, determining a data item value using the extracted subset of data array elements; and

Outputting the data item value.

Wherein the data processing unit is configured to:

Detecting that coordinates associated with the data item of the block are axis aligned; and

Transmitting the coordinate set to a data loading unit; and

Wherein the data loading unit is configured to:

Outputting the data item value.

The graphics processing unit may be embodied in hardware on an integrated circuit. A method of manufacturing a graphics processing unit at an integrated circuit manufacturing system may be provided. An integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, configures the system to manufacture a graphics processing unit. A non-transitory computer readable storage medium may be provided having stored thereon a computer readable description of a graphics processing unit that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit that includes the graphics processing unit.

An integrated circuit manufacturing system may be provided, the integrated circuit manufacturing system comprising: a non-transitory computer readable storage medium having stored thereon a computer readable description of a graphics processing unit; a layout processing system configured to process the computer readable description to generate a circuit layout description of the integrated circuit including the graphics processing unit; and an integrated circuit generation system configured to fabricate the graphics processing unit in accordance with the circuit layout description.

Computer program code for performing any of the methods described herein may be provided. In other words, a computer readable code may be provided, which is configured to cause any of the methods described herein to be performed when the code is run. A non-transitory computer-readable storage medium having stored thereon computer-readable instructions may be provided that, when executed at a computer system, cause the computer system to perform any of the methods described herein.

As will be apparent to those skilled in the art, the above features may be suitably combined and combined with any of the aspects of the examples described herein.

Drawings

Examples will now be described in detail with reference to the accompanying drawings, in which:

FIG. 1 shows a schematic diagram of texture mapping between screen space and texture space;

FIG. 2 illustrates a graphics processing system including a graphics processing unit;

FIG. 3 is a flow chart for a method of applying texture processing to segment blocks in a graphics processing unit;

FIG. 4 shows a 4x4 fragment block;

FIG. 5 shows a reduced set of texture coordinates for a 4x4 segment block;

FIG. 6 illustrates a texture address generation module within a texture processing unit of a graphics processing unit;

fig. 7 is a flowchart showing an example of how steps S310 and S312 of the flowchart shown in fig. 3 can be performed when bilinear filtering is implemented;

FIG. 8 illustrates integer texel coordinates determined for a reduced set of texture coordinates in an example where a graphics processing unit applies bilinear filtering;

FIG. 9a shows an example of an 8x8 block of multiple pairs of integer texel coordinates, which would result from applying bilinear filtering to a 4x4 fragment block if the uniqueness process was not performed;

FIG. 9b shows a 5x5 block of pairs of integer texel coordinates, which 5x5 block would result from applying bilinear filtering to a 4x4 fragment block when performing the uniqueness process;

FIG. 10a shows a 2x4 segment block;

FIG. 10b shows an example of a 4x8 block of multiple pairs of integer texel coordinates, which would result from applying bilinear filtering to a 2x4 fragment block if the uniqueness process was not performed;

FIG. 10c shows a 3x5 block of pairs of integer texel coordinates, which 5x5 block would result from applying bilinear filtering to a 2x4 fragment block when performing the uniqueness process;

FIG. 11 illustrates a computer system in which a graphics processing unit is implemented; and

FIG. 12 illustrates an integrated circuit manufacturing system for generating an integrated circuit embodying a graphics processing unit.

The figures illustrate various examples. Skilled artisans will appreciate that element boundaries (e.g., blocks, groups of blocks, or other shapes) illustrated in the figures represent one example of boundaries. In some examples, it may be the case that one element may be designed as a plurality of elements, or that a plurality of elements may be designed as one element. Where appropriate, common reference numerals have been used throughout the various figures to indicate like features.

Detailed Description

The following description is presented by way of example to enable a person skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein and various modifications to the disclosed embodiments will be apparent to those skilled in the art.

Embodiments will now be described by way of example only. FIG. 2 illustrates a graphics processing system 200 including a Graphics Processing Unit (GPU) 202 and memory 204. GPU 202 includes a fragment processing unit 206, a Texture Processing Unit (TPU) 208, and one or more caches 210. The fragment processing unit 206, which may comprise a fragment shading unit or "unified shading cluster" (USC), is configured to process fragments, e.g. render pixels of an image, or apply some post-processing to pixels of an existing image. For example, the image may be a high definition image including, for example, 1920x1080 pixel blocks. The texture processing unit 208 includes a texture address generation module 212 (which may be referred to as "TAG"), an address processing module (which may be referred to as "MADD", which represents a "multiplexed address de-multiplexing decompressor"), and a texture filtering module 216 (which may be referred to as "TF"). GPU 202 includes an interface 218 for transferring data between fragment processing unit 206 and texture processing unit 208.

Examples described in detail herein relate to processing fragments. It should be understood that the 'fragments' described herein may generally be considered as data elements that the TPU can handle. The data elements may be image data elements, such as primitive fragments or pixels. The data elements may be, for example, non-image data elements for processing computational workloads, wherein the fragment processing unit may be considered to include a computational shader.

Fragment processing unit 206 and texture processing unit 208 may be implemented in hardware (e.g., fixed function circuitry), software, firmware, or a combination thereof. In general, software implementations (e.g., where software instructions are executed on a processing unit to implement functions) are more flexible than hardware implementations, but hardware implementations tend to be optimized to a greater extent (e.g., in terms of reducing latency and/or power consumption). Thus, for tasks such as texture processing, for which reduced latency and power consumption are important for the operation of the GPU and flexibility of operation is not so important, the functions are often implemented in hardware, e.g., in fixed function circuitry. When the modules (e.g., TAG 212, MADD 214, and TF 216) are implemented in hardware, it is beneficial to keep the silicon area of these modules low so that the integrated circuit implementing GPU 202 can be kept as small as possible. Reducing latency, power consumption, and silicon area of GPUs is particularly important when they are implemented in mobile devices (e.g., smartphones or tablet computers) where battery life, physical size, and performance are particularly critical.

When the fragment processing unit 206 processes a fragment (e.g., a pixel of an image), the fragment processing unit may determine that an image map (i.e., a "texture") is to be applied. The texture is represented as an array of texels (similar to an image being represented as an array of pixels). Each of the processed segments is associated with texture coordinates for each of the dimensions of the texture. For example, where the texture is a 2D texture, then each of the segments is associated with a U value and a V value (representing texture coordinates for a first dimension) representing texture coordinates for a second dimension orthogonal to the first dimension. The first dimension may be referred to herein as the "horizontal" dimension, and the second dimension may be referred to herein as the "vertical" dimension. The texture coordinates input into the TPU 208 may be any integer or floating point value and may be normalized or non-normalized, but the TPU may apply some processing (e.g., clamping, wrapping, etc.) to ensure that the texture coordinates are within the proper range. In the examples described herein, the texture coordinates (e.g., U and V) are in a floating point format, such as a single precision floating point format, where each value is represented by 32 bits, and TPU 208 ensures that each of the texture coordinates is in the range of 0.0 to 1.0. Texture coordinates for a fragment to which texture processing is to be applied are sent from the fragment processing unit 206 to the texture processing unit 208. TAG 212 of TPU 208 receives texture coordinates (e.g., U and V) and determines which texels of the texture should be fetched and the addresses of those texels. The TAG does this by converting floating point texture coordinates (e.g., U and V) to texel coordinates (e.g., U and V). The texel coordinates may be in a fixed point format. In the case where the texture is a T _v×T_u texel block, the texel coordinates for the horizontal dimension may be in the range of 0 to T _u -1, and the texel coordinates for the vertical dimension may be in the range of 0 to T _v -1. For example, if the texture is a 1920x1080 texel block, then: (i) T _u is 1080 and texel coordinates (u) for the horizontal dimension may be in the range of 0 to 1079, and (ii) T _v is 1920 and texel coordinates (v) for the vertical dimension may be in the range of 0 to 1919. TAG 212 then rounds the texel coordinates to an integer and uses the integer texel coordinates (e.g. u and v) to generate the texel address of the texel to be extracted. The texel address is a memory address, i.e. an address in the memory 204 where the texel is stored. TAG 212 passes the texel address to MADD 214.TAG 212 also sends some sideband information to TF 216 to indicate how the extracted texels should be processed (e.g. filtered).

MADD 214 accepts texture requests from TAG 212. MADD 214 may include an L0 cache and decompresses (if necessary) and passes the texels to TF 216 if the requested texels (i.e. the texels with the generated texel addresses) are in the L0 cache. If the requested texel is not in the L0 cache in MADD 214, the MADD sends the request to the L1 cache (which is one of caches 210) to fetch the requested texel. If the data for the requested texel is in cache 210, the data is returned to MADD 214, but if the data for the requested texel is not in cache 210, the data is requested from memory 204 and returned to MADD 214. The MADD may decompress the texel data (if it is compressed) and then send the texel data to TF 216. The order in which MADD 214 sends the texels to TF 216 may be the same as the order in which the texels were received at MADD 214 from TAG 212.

The TF 216 receives the texel data from the MADD 214 and the sideband information from the TAG 212 and processes the texel data according to the sideband information. For example, the TF may provide processing for implementing point sampling, bilinear filtering, polynomial filtering, trilinear filtering, anisotropic filtering, and the like. The processed values are output from TF 216 and provided back to fragment processing unit 206. The segment processing unit 206 may implement further processing of the values it receives from the TPU 208, for example, to determine a final processed image, which may then be used in any suitable way, for example, displayed on a display, stored in a memory, and/or transmitted to another device.

Texture processing is a particularly costly process implemented in GPU 202, and thus any improvement in reducing latency, reducing power consumption, and/or reducing silicon area is particularly beneficial in the particular implementation of texture processing. Examples are described herein in which, in some common situations, the performance of texture processing may be improved (i.e., latency may be reduced) and/or the power consumption of GPU 202 may be reduced with little or no increase in any of three factors: time delay, power consumption, and silicon area. The examples described herein achieve these benefits when the texture coordinates associated with the segments of the segment block are axis aligned, and when the TPU 208 is to apply some type of texture processing (e.g., point sampling or bilinear filtering). Texture coordinate axis alignment is quite common, for example, when the TPU 208 performs post processing, and when the TPU 208 is used to apply texturing to render a 2D game or Graphical User Interface (GUI). Thus, in examples described herein, a system (e.g., fragment processing unit 206) may detect whether texture coordinates of fragments for a fragment block to be processed are axis aligned, and if the texture coordinates are axis aligned, may optimize application of texture processing to the fragment block to be more efficient (e.g., in terms of latency and/or power consumption). In the examples described herein, when TPU 208 performs point sampling or bilinear filtering on a segment block having axis aligned texture coordinates, the rate at which TPU 208 can process the segment may be doubled and power consumption may be reduced with minimal increase in the silicon area of TPU 208. In the examples described herein, one feature that helps achieve these benefits is to reduce the number of texture coordinates that need to be transferred from fragment processing unit 206 to texture processing unit 208, thereby reducing the amount of data transferred through interface 218 and reducing the amount of processing that needs to be performed by TAG 212 (e.g., reducing the number of floating point texture coordinates that are converted to fixed point integer texel coordinates). In the examples described herein where texture filtering (e.g., bilinear filtering or more generally polynomial filtering) is applied, another feature that helps achieve these benefits is that a uniqueness process may be performed on the integer texel coordinates prior to generating the texel address. The uniqueness process may reduce the number of generated texel addresses. Further details of how the examples described herein may achieve these benefits are explained below.

FIG. 3 is a flow chart for a method of applying texture processing to segment blocks in a graphics processing unit 202. In step S302, the fragment processing unit 206 obtains a fragment block to which texture processing is to be applied. As described above, each of the fragments of the block is associated with texture coordinates (e.g., U-values and V-values in floating point format) for each of a plurality of dimensions of the texture. A segment block may refer to a block of pixels that is part of a larger image (e.g., an image having 1920x1080 pixels). The fragment processing unit 206 may obtain fragment blocks in step S302 as part of the process of rendering the fragment blocks, for example in case textures are to be applied to objects in the rendered scene. Alternatively, the segment processing unit 206 may obtain a segment block (which may represent a pixel block) in step S302 as part of a process of applying a certain post-processing to an existing image. The existing image may be a previously rendered image or an image that has been generated by some process other than rendering, for example, the existing image may be in the form of an image captured by a camera or generated in any other way. In the case where the segment block corresponds to a block of pixels that are part of a larger image (e.g., 1920x1080 image), the segment processing unit 206 may be operative to apply some processing (e.g., post-processing) to each pixel of the image, but the segment processing unit will not send the texture coordinates of all of the pixels of the larger image to the TPU 208 once. Instead, the fragment processing unit sends texture coordinates for a block of pixels (e.g., a 4x4 block of pixels) once (e.g., within each clock cycle) to the TPU 208.

Fig. 4 shows a 4x4 fragment block 402 that may be obtained in step S302. As indicated in fig. 4, the fragments within the 4x4 block 402 are ordered from 0 to 15. In other examples, the segments may be ordered in other ways. The ordering shown in fig. 4 is referred to as "Z-order" and may match the order in which the fragments are rendered by fragment processing unit 206. Z-order may be used to optimize cache locality, but a different order (such as N-order) is also suitable for optimizing cache locality, and more generally any ordering of fragments within a fragment block may be used, provided that the ordering is consistent and predetermined for each of the fragment blocks to be processed. Furthermore, while the Z-order shown in fig. 4 may be matched to the rasterization pattern, the methods described herein may be used with computing kernels that do not use rasterization, such as by reordering thread texture coordinates to match the rasterization pattern. In one example, reordering the compute kernel work to match the rasterization pattern may be accomplished by adding instructions to the compute kernel. This may be done manually by a developer or driver/compiler. In another example, the GPU may include hardware that automatically detects that the computing job may be reordered to match the rasterization pattern and automatically applies the job reordering. In another example, the GPU may include functionality for detecting an order and adapting the detection pattern to match this order.

In step S304, the fragment processing unit 206 detects whether texture coordinates of fragments for a block are axis-aligned. Step S304 may involve detecting a pattern indicating that the texture coordinates are axis aligned. Step S304 may include detecting whether the U coordinates in each column are the same for all of the segments in the column and whether the V coordinates in each row are the same for all of the segments in the row. More generally, step S304 may include: for each of the dimensions of the texture, for each row of segments perpendicular to the dimensions within the segment block, it is determined whether the texture coordinates for the dimension are the same for all of the segments within the row. The level of precision at which this determination is made may vary in different examples. In a first example, the texture coordinates may be determined to be "identical" only if the texture coordinates are exactly identical, i.e., when all of the bits of the texture coordinates are identical; in a second example, when determining whether the texture coordinates are identical, one or more of the least significant bits of the mantissa of the texture coordinates may be ignored, such that approximately identical texture coordinates may be determined to be identical (even though the texture coordinates are not identical). In some examples, determining whether texture coordinates may be considered identical may involve considering API accuracy requirements or other conditions.

Referring to the example of the 4x4 fragment block 402 shown in fig. 4, step S304 may involve detecting whether the U texture coordinates in each column for all of the fragments in that column are the same by determining whether there are:

U[0]＝U[2]＝U[8]＝U[10]，

U[1]＝U[3]＝U[9]＝U[11]，

u [4] =u [6] =u [12] =u [14], and

U[5]＝U[7]＝U[13]＝U[15]，

Where U i is texture coordinates in the horizontal dimension for the ith fragment of block 402. Further, in this example, step S304 may involve detecting whether the V-texture coordinates for all of the segments in each row are the same for that row by determining whether there are:

V[0]＝V[1]＝V[4]＝V[5]

V[2]＝V[3]＝V[6]＝V[7]

v [8] =v [9] =v [12] =v [13], and

V[10]＝V[11]＝V[14]＝V[15]，

Where V [ j ] is the texture coordinates in the vertical dimension for the j-th segment of block 402.

If all of the equations in the previous paragraph are true, in step S304, it is detected that the texture coordinates for the segments of the block are axis aligned; while if one or more of the equations in the previous paragraph are not true, in step S304, it is detected that the texture coordinates for the segments of the block are not axis aligned.

If the fragment processing unit 206 detects in step S304 that the texture coordinates of the fragment for the block are axis aligned, the method goes to step S306. If the fragment processing unit 206 detects in step S304 that the texture coordinates of the fragment for the block are not axis aligned, the method goes to step S320 (which is described below).

In step S306, in response to detecting that the texture coordinates for the segment of block 402 are axis aligned, the segment processing unit 206 sends the reduced set of texture coordinates to the texture processing unit 208. The reduced texture coordinate set includes: (i) For each column segment in the block, only one texture coordinate for the horizontal dimension (i.e., only one U coordinate per column), and (ii) for each row segment in the block, only one texture coordinate for the vertical dimension (i.e., only one V coordinate per row).

Fig. 5 shows a reduced set of texture coordinates 502 for a 4x4 segment block 402. In particular, in this case, it has been found that the texture coordinates are axis aligned, and the U coordinates (U ₀、U₁、U₂ and U ₃) for the four column segment of block 402 are given by:

U₀＝U[0]＝U[2]＝U[8]＝U[10]，

U₁＝U[1]＝U[3]＝U[9]＝U[11]，

u ₂ = U [4] = U [6] = U [12] = U [14], and

U₃＝U[5]＝U[7]＝U[13]＝U[15]。

Similarly, since in this case the texture coordinates have been found to be axis aligned, the V coordinates (V ₀、V₁、V₂ and V ₃) for the four line segment of block 402 are given by:

V₀＝V[0]＝V[1]＝V[4]＝V[5]

V₁＝V[2]＝V[3]＝V[6]＝V[7]

V ₂ = V [8] = V [9] = V [12] = V [13], and

V₃＝V[10]＝V[11]＝V[14]＝V[15]。

The complete texture coordinate set (i.e., the non-reduced texture coordinate set) for the 4x4 segment block will include 16U texture coordinates and 16V texture coordinates (i.e., the U texture coordinates and V texture coordinates for each of the segments in block 402). In previous systems that do not use a reduced set of texture coordinates as in the examples described herein, 32 texture coordinates would be sent from the fragment processing unit 206 to the texture processing unit 208 for application of texture processing to the 4x4 fragment block 402. In the example shown in fig. 5, the reduced set of texture coordinates includes only four U texture coordinates and four V texture coordinates, such that in the example described herein, for a 4x4 segment block 402 with aligned texture coordinate axes, only eight texture coordinates are sent from the segment processing unit 206 to the texture processing unit 208 for application of texture processing to the 4x4 segment block 402. In other words, in this case, only one quarter of the texture coordinates are sent from the fragment processing unit 206 to the texture processing unit 208. This means that less data is transferred through the interface 218 between the segment processing unit 206 and the TPU 208, which results in reduced power consumption and may allow the interface 218 to be narrower, thereby reducing the silicon area of the GPU 202. Further, there may be a limit to the number of texture coordinates that may be sent from the segment processing unit 206 to the TPU 208 over the interface 218 within each clock cycle. For example, the limit may be 16 such that if a full set of texture coordinates is used for a 4x4 segment block, it would take 2 clock cycles to send 32 texture coordinates for the block to the TPU 208, while if a reduced set of texture coordinates is used for a 4x4 segment block, it would take 1 clock cycle to send 8 texture coordinates for the block to the TPU 208, thereby facilitating a doubling of the rate of applicable texture processing. In other words, if a complete set of texture coordinates is sent to TPU 208, the TPU receives texture coordinates for only eight segments at a time, while if a reduced set of texture coordinates is sent to TPU 208, the TPU receives texture coordinates for sixteen segments at a time, thus doubling the rate and significantly reducing power consumption with minimal increase in silicon area.

In addition to sending texture coordinates to TPU 208, in response to detecting that the texture coordinates for a segment of a block are axis aligned, segment processing unit 206 sends an indication (e.g., a 1-bit indication) to TPU 208 indicating whether the texture coordinates are axis aligned.

Although fig. 4 shows 4x4 segment blocks, it should be noted that in other examples, segment blocks may be of different sizes and/or shapes. In particular, the segment blocks may be mxn segment blocks, wherein the texture is a 2D texture such that each segment is associated with two texture coordinates (U and V). The reduced texture coordinate set includes n texture coordinates for the horizontal dimension and m texture coordinates for the vertical dimension. In some examples, for example, as shown in fig. 4, m=n=4, while in some other examples, m and n may take different values, for example, in a first further example, m=4 and n=2, in a second further example, m=2 and n=4, in a third further example, m=n=2, in a fourth further example, m=n=8, in a fifth further example, m=8 and n=4, and in a sixth further example, m=4 and n=8.

In step S308, the texture processing unit 208 (in particular TAG 212) processes the reduced texture coordinate set to generate a texel address of the texel to be extracted. As an example, fig. 6 shows some details of the texture address generation module (TAG) 212 within the TPU 208. TAG 212 includes a TAG front end 602, texture coordinate to texel coordinate conversion logic 604 (which may be referred to herein as "texture to texel conversion logic 604"), uniqueness logic 606, and a plurality of address generators 608.

In the case of texture coordinate axis alignment of a fragment block, a reduced set of texture coordinates is received from the fragment processing unit 206 at the TAG front end 602 along with an indication of texture coordinate axis alignment. As described herein, the TAG front end determines whether the texture state is compatible with texture processing optimization for axis aligned texture coordinates. It should be noted that some of the checks for status may be performed in the fragment processing unit 206 and some of the checks may be performed in the TAG front end 602 implementation. Many different fields of texture state data may be examined, but to name a few examples, it may be checked that the texture is a 2D texture, no anisotropic filtering is applied, and the mipmap will not be used to apply texture at a thinned-out Level (LOD) in order to determine whether the texture states are compatible.

If the TAG front end 602 determines that the texture state is compatible with texture processing optimization for the axis aligned texture coordinates, then the reduced set of texture coordinates is passed to the texture-to-texel conversion logic 604. However, if TAG front end 602 determines that the texture state is not compatible with texture processing optimization for the axis aligned texture coordinates, the reduced set of texture coordinates is decompressed to determine a complete set of texture coordinates (i.e., 32 texture coordinates) that is then passed to texture-to-texel conversion logic 604 for processing without further optimization as described herein for the reduced set of texture coordinates.

As shown in fig. 3, in the case of passing the reduced texture coordinate set to the texture-to-texel conversion logic 604, step S308 may include: (i) In step S310, the texture-to-texel conversion logic 604 determines a set of one or more integer texel coordinates for each texture coordinate in the reduced set of texture coordinates; and (ii) in step S312, TAG 212 (e.g., address generator 608) generates a texel address of the texel to be extracted using the determined integer texel coordinates.

In particular, in step S310, the texture-to-texel conversion logic 604 determines the type of texture processing to be applied, e.g., point sampling or texture filtering (such as bilinear filtering or another polynomial filtering). An indication of the determined texture processing type is sent as sideband data from texture to texel conversion logic 604 to TF 216. Also in step S310, the texture-to-texel conversion logic 604 converts each texture coordinate in the reduced set of texture coordinates from a floating point format (e.g., single precision floating point format, where each texture coordinate uses 32 bits to represent a number between 0.0 and 1.0) to a fixed point format representing texel coordinates. In the case where the texture is a T _v×T_u texel block, the texel coordinates for the horizontal dimension may be in the range of 0 to T _u -1, and the texel coordinates for the vertical dimension may be in the range of 0 to T _v -1. For example, if the texture is a 1920x1080 texel block, then: (i) T _u is 1080 and texel coordinates (u) for the horizontal dimension may be in the range of 0 to 1079, and (ii) T _v is 1920 and texel coordinates (v) for the vertical dimension may be in the range of 0 to 1919. Each texel coordinate in the texel coordinates is rounded to an integer texel coordinate. For example, texel coordinates may be rounded down to integer texel coordinates. The fractional part of the texel coordinates (prior to rounding) may be passed to TF 216 for texture filtering, as described below. In other examples, another rounding mode may be used, e.g., texel coordinates may be rounded up to integer texel coordinates, or texel coordinates may be rounded to nearest integer texel coordinates (e.g., with a tie rounded to even). If a round-up or round-to-nearest rounding mode is used, the fractional portion of the texel coordinates may be determined after rounding (e.g., by finding the difference between the unrounded texel coordinates and the rounded texel coordinates). The passing of the fractional portion of the texel coordinates from the texture-to-texel conversion logic 604 to the TF 216 may depend on the type of texture processing performed, e.g., if texture filtering (e.g., bilinear filtering) is applied, the fractional portion of the texel coordinates may be passed to the TF 216, but if point sampling is performed, the texel coordinates may not be passed to the TF 216. It should be noted that converting floating point texture coordinates (e.g., U and V) to fixed point integer texel coordinates is a relatively costly process in terms of power consumption.

As an example, for 1920x1080 textures, if the floating point U texture coordinates and V texture coordinates are u=0.5 and v=0.5, respectively, then the texel coordinates determined by the texture-to-texel conversion logic 604 will be u= 959.5 and v=539.5. These values may be rounded down to u=959 and v=539, and the fractional part of the texel coordinates (u _frac =0.5 and v _frac =0.5) (e.g., when bilinear filtering is applied) may be passed to TF 216.

If the texture processing performed is point sampling, a single integer texel coordinate (e.g. U or V) is determined for each texture coordinate (e.g. U or V) of the reduced set of texture coordinates. In this case, the uniqueness logic 606 may not be used and the eight integer texel coordinates for the 4x4 fragment block are passed from the texture-to-texel conversion logic 604 to the address generator 608. Within each clock cycle, each of the address generators 608 may generate a texel address for the texel to be extracted based on an integer texel coordinate pair, for example, taking into account the texture format and whether the texture is stride or rotated, as well as other factors known to those skilled in the art. In one example, TAG 212 includes 32 address generators 608, and in this example, half of the address generators (i.e., 16 of the address generators) may be used to generate the texel addresses to be extracted for applying the point samples to the texels of the segment block within each clock cycle when performing the point samples. It should be noted that in other examples, TAG 212 may include more or less than 32 address generators 608.

Each of the texel addresses corresponds to a pair of the determined integer texel coordinates, wherein each of the plurality of pairs of integer texel coordinates includes a u texel coordinate (i.e. texel coordinates for the horizontal dimension) and a v texel coordinate (i.e. texel coordinates for the vertical dimension). Up to this point in the texture processing pipeline (i.e., up to address generator 608), the horizontal and vertical coordinates have been processed independently, meaning that when the texture coordinates are axis aligned, the number of coordinates processed up to this point is reduced. At this point, however, the TPU 208 (particularly the address generator 608) does generate a texel address for each of the segments. When the system determines (in step S306) a reduced texture coordinate set, the system may be considered to compress the U-coordinates and the V-coordinates, and when the address generator 608 again pairs the texel coordinates, the system may be considered to decompress the texel coordinates in step S312. It should be noted that in some alternative examples, the address generator may be implemented later in the pipeline, e.g., in extreme examples, the L0 cache may be accessed based on integer texel coordinates (u and v), and the texel address may be generated only in response to a miss on the L0 cache.

For example, the pairs of texel coordinates for each of the fragments (P [0] to P [15 ]) shown in fig. 4 are:

P[0]：u₀,v₀

P[1]：u₁,v₀

P[2]：u₀,v₁

P[3]：u₁,v₁

P[4]：u₂,v₀

P[5]：u₃,v₀

P[6]：u₂,v₁

P[7]：u₃,v₁

P[8]：u₀,v₂

P[9]：u₁,v₂

P[10]：u₀,v₃

P[11]：u₁,v₃

P[12]：u₂,v₂

P[13]：u₃,v₂

P[14]：u₂,v₃

P[15]：u₃,v₃

Where U _i is an integer texel coordinate determined from texture coordinate U _i (0, 1,2,3 for i), and where V _j is an integer texel coordinate determined from texture coordinate V _j (0, 1,2,3 for j).

The texel address generated by address generator 608 is a memory address indicating where the corresponding texel is stored in memory 204. The generated texel address is passed from TAG 212 to MADD 214.

In step S314, the address processing Module (MADD) 214 extracts texels using the generated texel address. The extracted texels are decompressed by MADD 214 (if the extracted texels are compressed) and then provided to TF 216. Texels may be fetched from a cache or memory 204. As described above, MADD 214 may itself include an L0 cache and decompress (if necessary) and pass the texels to TF 216 if the requested texels (i.e. the texels with the generated texel addresses) are in the L0 cache. If the requested texel is not in the L0 cache in MADD 214, the MADD sends the request to the L1 cache (which is one of caches 210) to fetch the requested texel. If the data for the requested texel is in cache 210, the data is returned from cache 210 to MADD 214, but if the data for the requested texel is not in cache 210, the data is requested from memory 204 and returned to MADD 214. The order in which MADD 214 sends the texels to TF 216 may be the same as the order in which the texels were received at MADD 214 from TAG 212.

The TF 216 receives texel data from the MADD 214 and sideband information from the TAG 212. In step S316, the TF 216 determines a processing value for each of the fragments of the block based on the extracted texels. In particular, TF 216 processes texel data according to the sideband information. For example, in the case where texture processing is point sampling, the processing value for each of the segments may be the extracted texel for that segment.

In step S318, the TF 216 outputs the processing value. Some further processing, such as color space conversion or gamma correction, may or may not be performed on the output processed values in the TPU 208 and the processed values are then provided to the segment processing unit 206 via the interface 218. Since the rate of point sampling is doubled for the axis pairs Ji Wenli, within each clock cycle (as compared to the processing values for 8 segments of the non-axis aligned texture) the processing values for 16 segments can be provided from the TPU 208 to the segment processing unit 206 and the width of the interface 218 is made wide enough to accommodate this.

The segment processing unit 206 may perform further processing on the processing values it receives from the TPU 208 in order to determine a final processed image, which may then be used in any suitable way, for example, displayed on a display, stored in a memory and/or transmitted to another device.

As described above, in step S310, the texture-to-texel conversion logic 604 determines the type of texture processing to be applied. Examples of application point sampling are described above. An example of applying bilinear filtering will now be described with reference to fig. 7 to 10 c. An indication of the determined texture processing type (i.e., bilinear filtering in this example) is sent as sideband data from texture-to-texel conversion logic 604 to TF 216. When the TPU 208 applies bilinear filtering, the TF 216 (in step S316) determines the result of the bilinear interpolation of four of the texels extracted in step S314 as a processed value for each of the segments of the block 402.

Fig. 7 is a flowchart showing an example of how steps S310 and S312 of the flowchart shown in fig. 3 can be performed when bilinear filtering is implemented. In this case, step S310 (of determining a set of one or more integer texel coordinates for each of the texture coordinates) is performed by executing step S702, in which step S702 the texture-to-texel conversion logic 604 determines two integer texel coordinates for each of the texture coordinates of the reduced set of texture coordinates.

As described above, when the texture-to-texel conversion logic 604 converts texture coordinates (e.g., U and V) from a floating point format to texel coordinates (e.g., U and V) in a fixed point format, the logic 604 rounds each of the texel coordinates to integer texel coordinates. in the case of bilinear filtering the signal is filtered, For texel coordinates (U _i or V _j) determined from texture coordinates (U _i or V _j), For each texel coordinate in i e {0,1,2,3} and j e {0,1,2,3}, the texture-to-texel conversion logic 604 rounds the texel coordinate down to determine a first integer texel coordinate (u _i- or v _j-) and rounds the texel coordinate up to determine a second integer texel coordinate (u _i+ or v _j+). In this case, u _i+＝u_i- +1, and v _j+＝v_j- +1. In other words, in the alternative, the two integer texel coordinates (e.g. U _i- and U _i+ or V _j- and V _j+) determined for each of the texture coordinates (e.g. U _i and V _j) are: (i) A first integer texel coordinate (e.g. u _i- or v _j-), The first integer texel coordinates correspond to texture coordinates rounded down to integer texel positions; and (ii) a second integer texel coordinate (e.g. u _i+ or v _j+), the second integer texel coordinate being an integer texel coordinate greater than the first integer texel coordinate.

The fractional part of the texel coordinates may be passed to TF 216 for texture filtering. Fig. 8 shows integer texel coordinates 802 determined for a reduced set of texture coordinates in an example of applying bilinear filtering.

Step S702 is performed independently for each of the texture coordinates (U ₀、U₁、U₂、U₃、V₀、V₁、V₂ and V ₃) and before pairing the coordinates for generating a texel address. It should be noted that for each of the fragments of the fragment block, four pairs of integer texel coordinates correspond to four texel addresses of four texels to be extracted for performing bilinear interpolation for that fragment. For example, as shown in fig. 8, four pairs of integer texel coordinates (u _0-,v_0-;u₀₊,v_0-;u_0-,v₀₊;u₀₊,v₀₊) in the box denoted 804 ₀ correspond to four texel addresses of four texels to be extracted for performing bilinear interpolation for the upper left segment of the 4x4 segment block. As another example, as shown in fig. 8, four pairs of integer texel coordinates (u _1-,v_1-;u₁₊,v_1-;u_1-,v₁₊;u₁₊,v₁₊) in the box denoted 804 ₁ correspond to four texel addresses of four texels to be extracted for performing bilinear interpolation for a segment in the second row of the second column of the 4x4 segment block. As another example, as shown in fig. 8, four pairs of integer texel coordinates (u _2-,v_2-;u₂₊,v_2-;u_2-,v₂₊;u₂₊,v₂₊) in the box denoted 804 ₂ correspond to four texel addresses of four texels to be extracted for performing bilinear interpolation for a segment in the third row of the third column of the 4x4 segment block. As another example, as shown in fig. 8, four pairs of integer texel coordinates (u _3-,v_3-;u₃₊,v_3-;u_3-,v₃₊;u₃₊,v₃₊) in the box denoted 804 ₃ correspond to four texel addresses of four texels to be extracted for performing bilinear interpolation for the lower right segment of the 4x4 segment block.

As shown in fig. 7, when bilinear filtering is implemented, step S312 (of generating a texel address of a texel to be extracted using the determined integer texel coordinates) is performed by performing steps S704 and S706.

When the texture coordinates are axis aligned, some of the 4 texels used for bilinear filtering for one segment may be the same as some of the texels used for bilinear filtering one or more other segments in the segment block. When the texels are identical, the address of the texel may be generated once (and the address may be extracted once) instead of multiple times. This may result in faster bilinear filtering without adding more address generators. It should be noted that adding more address generators will increase the silicon area and power consumption of TAG 212 in TPU 208.

Thus, in step S704, the uniquene logic 606 performs a uniquene process on the determined integer texel coordinates to remove one or more duplicate integer texel coordinates and thereby determine a subset of the determined integer texel coordinates. It should be understood that the term "subset" is used herein to mean a "proper subset", i.e., such that less than all of the integer texel coordinates determined in step S702 are included in the subset of integer texel coordinates determined in step S704. The determined subset of integer texel coordinates is provided from the uniqueness logic 606 to an address generator 608. For mxn segment blocks, where each segment is associated with texture coordinates for a horizontal dimension and texture coordinates for a vertical dimension of the 2D texture, the determined subset of integer texel coordinates may include n+1 integer texel coordinates for the horizontal dimension and m+1 integer texel coordinates for the vertical dimension.

In step S706, the address generator 608 generates a texel address of the texel to be extracted using the determined subset of integer texel coordinates.

It should be noted that a uniqueness process is performed on the integer texel coordinates before generating the texel address such that if the integer texel coordinate pair for the first segment of the block is the same as the integer texel coordinate pair for the second segment of the block, the texel address corresponding to the integer texel coordinate pair is generated once for processing the segment block. For example, the uniqueness process performed in step S704 may make all of the texel addresses generated for processing the fragment block in step S706 unique.

The method then proceeds to step S314, where the texels are extracted using the generated texel address, as described above.

In response to determining that there is a sufficient number of repetitions of the determined integer texel coordinates, in step S704, a uniquene process may be performed on the determined integer texel coordinates by the uniquene logic 606. If the uniqueness logic 606 has not been able to remove a sufficient number of integer texel coordinates (e.g. if there are not enough duplicate integer texel coordinates), the uniqueness logic 606 may provide all of the integer texel coordinates determined in step S702 to the address generator. For example, the determined subset of integer texel coordinates may correspond to N texel addresses to be extracted, and if N is less than or equal to the number of address generators 608 (e.g. there may be 32 address generators), the determined subset of integer texel coordinates may be provided to the address generators 608 in step S704, such that the address generators 608 are able to generate a texel address for the texel to be extracted within a single clock cycle. In contrast, if N is greater than the number of address generators 608, all of the integer texel coordinates determined in step S702 may be provided to the address generator 608 in step S704, and the address generator 608 may generate a texel address for the texel to be extracted over a plurality of clock cycles (e.g. over 2 clock cycles).

Fig. 9a shows an example of an 8x8 block of multiple pairs of integer texel coordinates 902 that would result from applying bilinear filtering to a 4x4 fragment block if the uniqueness process was not performed. In particular, for a segment at position (i, j) in the segment block, four pairs of integer texel coordinates are denoted as u _i-,v_j-、u_i+,v_j-、u_i-,v_j+ and u _i+,v_j+. Opportunities for uniqueness occur when some of the pairs of integer texel coordinates within block 902 are the same.

Typically (e.g., when using TPU 208 for post-processing or rendering a 2D scene such as a graphical user interface), the level of detail of texture filtering corresponds to a 1:1 mapping between the pitch of the segments in the segment block and the pitch of the texels in the texture. When this 1:1 mapping exists between fragment blocks and texture, then a uniqueness process may be used such that instead of an 8x8 block of multiple pairs of integer texel coordinates as shown in FIG. 9a, a 5x5 block of multiple pairs of integer texel coordinates as shown in FIG. 9b exists. That is, fig. 9b shows a 5x5 block of pairs of integer texel coordinates 904, which 5x5 block would result from applying bilinear filtering to a 4x4 fragment block when the uniqueness process is performed.

In this example, u _i+＝u_(i+1)-, and v _j+＝v_(j+1)- are utilized with a 1:1 mapping. More specifically ,u₀₊＝u_1-,u₁₊＝u_2-,u₂₊＝u_3-,v₀₊＝v_1-,v₁₊＝v_2-, and v ₂₊＝v_3-. This means that all of the pairs of integer texel coordinates in the pairs of integer texel coordinates shown in cross-hatching in fig. 9a are copies of one of the pairs of integer texel coordinates not shown in cross-hatching in fig. 9 a. Thus, the pairs of integer texel coordinates not shown with cross hatching in fig. 9a are present in fig. 9b, but the pairs of integer texel coordinates shown with cross hatching in fig. 9a are not present in fig. 9 b.

It can be appreciated that in the examples shown in fig. 9a and 9b, if no uniqueness is performed, TAG 212 will generate 64 texel addresses (corresponding to the 64-integer texel coordinates shown in fig. 9 a), but when uniqueness is performed, TAG 212 will generate only 25 texel addresses (corresponding to the 25-integer texel coordinates shown in fig. 9 b). In the above example where TAG 212 includes 32 address generators 608, then without uniqueness it would take two clock cycles to generate all of the texel addresses (because 64=2×32), while with uniqueness it would take only one clock cycle to generate all of the texel addresses (because 25< 32). Thus, the rate of generating the texel addresses can be doubled since the uniqueness does not require changing the number of address generators in TAG 212 (i.e. does not increase silicon area or power consumption).

In the example shown in fig. 9b, in the case of a 1:1 mapping, a 5x5 block of multiple pairs of integer texel coordinates (corresponding to the texel address generated for the texel to be extracted) is determined for applying bilinear filtering to the 4x4 segment block. In general, in the case of a 1:1 mapping, to apply bilinear filtering to mxn fragment blocks, a (m+1) x (n+1) block of multiple pairs of integer texel coordinates (corresponding to the texel address generated for the texel to be extracted) may be determined. This is in contrast to determining a (2 m) x (2 n) block of multiple pairs of integer texel coordinates when uniqueness is not performed in the TAG 212.

Fig. 10a to 10c show another example. In particular, fig. 10a shows a 2x4 segment block 1002. Fig. 10b shows an example of a 4x8 block of multiple pairs of integer texel coordinates 1004 that would result from applying bilinear filtering to a 2x4 fragment block if the uniqueness process was not performed. Fig. 10c shows a 3x5 block of multiple pairs of integer texel coordinates 1006, which 3x5 block would result from applying bilinear filtering to a 2x4 fragment block when the uniqueness process is performed using the 1:1 mapping described above. All of the pairs of integer texel coordinates shown in cross-hatching in fig. 10b are copies of one of the pairs of integer texel coordinates not shown in cross-hatching in fig. 10 b. Thus, the pairs of integer texel coordinates not shown with cross hatching in fig. 10b are present in fig. 10c, but the pairs of integer texel coordinates shown with cross hatching in fig. 10b are not present in fig. 10 c.

It can be appreciated that in another example shown in fig. 10 a-10 c, TAG 212 would generate 32 texel addresses (corresponding to the 32 pairs of integer texel coordinates shown in fig. 10 b) if no uniqueness was performed, but TAG 212 would generate only 15 texel addresses (corresponding to the 15 pairs of integer texel coordinates shown in fig. 10 c) when uniqueness was performed. In some implementations, TAG 212 may include 16 address generators 608, and in these implementations, it would take two clock cycles to generate all of the texel addresses (because 32=2×16) without uniqueness, and it would take nearly one clock cycle to generate all of the texel addresses (because 15< 16) with uniqueness. Thus, since the uniqueness does not require changing the number of address generators in TAG 212 (i.e., does not increase silicon area or power consumption), in this other example, the rate of generating texel addresses can also be doubled.

10 A-10 c illustrate that when performing the uniqueness process, a 3x5 block of multiple pairs of integer texel coordinates may be determined as per the 1:1 mapping described above He Liyong when bilinear filtering is applied to the 2x4 fragment block. In a similar example (not shown in the figures), it will be appreciated that when performing the uniqueness process, when applying bilinear filtering to 4x2 fragment blocks, the above described 1:1 mapping can be utilized to determine 5x3 blocks of pairs of integer texel coordinates.

The uniqueness logic 606 sends an indication to the TF 216 indicating whether it has performed a uniqueness process on the integer texel coordinates.

Returning to fig. 3, when bilinear filtering is implemented, MADD 214 extracts the texels using the generated texel addresses as described above and passes the texels to TF 216 in step S314.

In step S316, when bilinear filtering is implemented, TF 216 determines a filtered value by applying filtering to a subset of the extracted texels for each of the segments of the block. In particular, in step S316, TF 216 performs a de-uniqueness process on the extracted texels to determine which of the extracted texels are included in the subset for each of the fragments of the block. In other words, TF 216 performs a de-uniqueness process on the extracted texels to determine which of the extracted texels are to be included in the bilinear interpolation for each of the segments of the block. Performing bilinear interpolation for a segment uses four of the extracted texels, and thus includes those four of the extracted texels for a subset of the segment. For each of the segments of the block, four pairs of integer texel coordinates correspond to four texel addresses of the four extracted texels of the subgroup. To perform the de-uniqueness, the TF 216 uses the sideband data it receives from the uniqueness logic 606 of the TAG 212, which indicates how the data is de-uniqueness in the TAG 212. In this way, the uniqueness performed by the uniqueness logic 606 may be reversed by the de-uniqueness performed by the TF 216, thereby determining a subset of the extracted texels to be used in bilinear interpolation for each of the segments.

As described above, when TAG 212 converts texture coordinates (e.g., U and V) to texel coordinates (e.g., U and V), the fractional portion of the texel coordinates is sent as sideband data from TAG 212 to TF 216.

TF 216 may determine bilinear filtered values for a particular fragment by using four texels (a, b, c, d) that have been extracted for that fragment. These four texels represent a texel quadrilateral surrounding the texel coordinates determined for the segment, e.g. with texel a at the top left of the quadrilateral, texel b at the top right of the quadrilateral, texel c at the bottom left of the quadrilateral, and texel d at the bottom right of the quadrilateral. The TF 216 may determine the bilinear filtered value (F) by: first, horizontal interpolation is performed such that:

α=a (1-u _coeff)+b.u_coeff and β=c (1-u _coeff)+d.u_coeff

And then vertical interpolation is performed such that:

F＝α(1-v_coeff)+β.v_coeff

where u _coeff is the horizontal interpolation weight and v _coeff is the vertical interpolation weight.

When TF 216 performs bilinear interpolation on a subset of the four extracted texels, the horizontal interpolation weight (u _coeff) for the bilinear interpolation of the segment is based on (e.g., may be equal to) the determined fractional portion of the texel position corresponding to the texture coordinates associated with the segment for the horizontal dimension, and the vertical interpolation weight (v _coeff) for the bilinear interpolation of the segment is based on (e.g., may be equal to) the determined fractional portion of the texel position corresponding to the texture coordinates associated with the segment for the vertical dimension.

Prior to generating the texel address in TAG 212, texture-to-texel conversion logic 604 may detect that the determined fractional portion of the texel position corresponding to the texture coordinates is zero. In response to detecting that the determined fractional portion of the texel position corresponding to the texture coordinate is zero, TAG 212 may determine that two texels of the four texels that do not require bilinear interpolation for the segment associated with the texture coordinate determine a result of bilinear interpolation of the four texels. Thus, texel addresses may not be generated for those that are not needed, and those texels may not be extracted. This is a further optimization that may reduce the power consumption of the TPU 208. For example, if u _coeff = 0, then F = a (1-v _coeff)+c.v_coeff, and F does not depend on texels b or d, so there is no need to generate texel addresses for texels b and d, and there is no need to extract texels b and d to determine the filtered value F. Similarly, as another example, if v _coeff = 0, F = a (1-u _coeff)+b.u_coeff, and F does not depend on texels c or d, so there is no need to generate texel addresses for texels c and d, and there is no need to extract texels c and d to determine the filtered value f.tag 212 may send two indications to TF 216 as sideband data, indicating whether u _coeff = 0, and whether v _coeff = 0.

In the example given above, there is a 1:1 mapping between the pitch of the fragments in the fragment block and the pitch of the texels in the texture. In other examples, there may be different mappings between the pitch of the fragments in the fragment block and the pitch of the texels in the texture. When the mapping (i.e., 'scaling') is not 1:1, then the texture coordinates (U and V) will change slower or faster than the fragment coordinates (X and Y). For example, for magnification, texture coordinates (U and V) move slower than fragment coordinates (X and Y), and in the best case uniqueness may even be better than the 1:1 mapping described above. For example, if U moves slower than X, the integer texel coordinate U _0- rounded down may be the same as the integer texel coordinate U _1- rounded down (such that U _0-＝u_1-, and U ₀₊＝u₁₊). Similarly, if V moves slower than Y, the integer texel coordinate V _0- rounded down may be the same as the integer texel coordinate V _1- rounded down (such that V _0-＝v_1-, and V ₀₊＝v₁₊).

In applying bilinear filtering to a 4x4 segment block such that texture-to-texel conversion logic 604 determines the 16 integer texel coordinates shown in fig. 9a (i.e. u_0-、u₀₊、u_1-、u₁₊、u_2-、u₂₊、u_3-、u₃₊、v_0-、v₀₊、v_1-、v₁₊、v_2-、v₂₊、v_3- and v ₃₊), then uniqueness logic 606 may determine that all of the basic integer texel coordinates (i.e. rounded-down texel coordinates u _0-、u_1-、u_2-、u_3-、v_0-、v_1-、v_2- and v _3-) and the two final integer texel coordinates in an 8x8 block (i.e. u ₃₊ and v ₃₊) will be included in the subset of integer texel coordinates. Each "+" integer texel coordinate will not be equal to the "-" integer coordinate of the same texel, but may be equal to one of the "-" integer coordinates of another texel. The uniqueness logic 606 may then determine whether all of the remaining integer texel coordinates (i.e. u ₀₊、u₁₊、u₂₊、v₀₊、v₁₊ and v ₂₊) are the same as at least one of the integer texel coordinates included in the subset 904, and if so, may perform the uniqueness process. In particular, by determining whether all six of the following expressions are satisfied, the uniqueness logic 606 may determine that there is a sufficient number of repetitions of the determined integer texel coordinates for the uniqueness process to be performed:

(u₀₊＝u_1-)∨(u₀₊＝u_2-)∨(u₀₊＝u_3-)∨(u₀₊＝u₃₊)

(u₁₊＝u_0-)∨(u₁₊＝u_2-)∨(u₁₊＝u_3-)∨(u₁₊＝u₃₊)

(u₂₊＝u_0-)∨(u₂₊＝u_1-)∨(u₂₊＝u_3-)∨(u₂₊＝u₃₊)

(v₀₊＝v_1-)∨(v₀₊＝v_2-)∨(v₀₊＝v_3-)∨(v₀₊＝v₃₊)

(v₁₊＝v_0-)∨(v₁₊＝v_2-)∨(v₁₊＝v_3-)∨(v₁₊＝v₃₊)

(v₂₊＝v_0-)∨(v₂₊＝v_1-)∨(v₂₊＝v_3-)∨(v₂₊＝v₃₊)

Where u _i- and u _i+ are two integer texel coordinates in the horizontal dimension for each of the fragments in the ith column of the fragment block, where i e 0,1,2,3, where v _j- and v _j+ are two integer texel coordinates in the vertical dimension for each of the fragments in the jth row of the fragment block, where j e 0,1,2,3, where v represents a logical OR operation.

An indication (e.g., a 1-bit indication) may be provided from TAG 212 to TF 216 in the sideband data indicating whether all six of the tests given in the previous paragraph are met, so that TF 216 knows whether a uniqueness process has been performed (and thus knows whether TF 216 needs to perform a de-uniqueness process). Further, for each of the six tests given above, an indication (e.g., a 2-bit indication) may be provided in the sideband data to the TF 216 indicating which of the four equations in the test is satisfied, such that the TF 216 knows how to perform the de-uniqueness process.

In some examples, one or both of the horizontal and vertical dimensions of the texture may be flipped relative to the dimensions of the segment block. For example, the application may flip things vertically such that V decreases with increasing Y, and/or the application may flip things horizontally such that U decreases with increasing X. When one or both of the dimensions are flipped, then the potentially equal integer texel coordinates will change, as demonstrated in the examples below.

For axis aligned texturing of 4x4 segment blocks, with 1:1 sampling, u ₀₊＝u_1-,u₁₊＝u_2-, and u ₂₊＝u_3-, and v ₀₊＝v_1-,v₁₊＝v_2-, and v ₂₊＝v_3-, with no flip in either the horizontal or vertical dimension. The "yes" and "no" indications in the following table indicate which texel addresses are generated in this case. 25 of the possible 64 texel addresses are generated. It should be noted that this table corresponds to fig. 9a.

In the same case, but with the horizontal dimension flipped (and the vertical dimension not flipped), u ₃₊＝u_2-,u₂₊＝u_1- and u ₁₊＝u_0-, and v ₀₊＝v_1-,v₁₊＝v_2-, and v ₂₊＝v_3-. The "yes" and "no" indications in the following table indicate which texel addresses are generated in this case, and 25 texel addresses out of the possible 64 texel addresses are generated again (note that the list head in the following table is different from the list head in the above table):

In the same case, but with the vertical dimension flipped (and the horizontal dimension not flipped), u ₀₊＝u_1-,u₁₊＝u_2- and u ₂₊＝u_3-, and v ₃₊＝v_2-,v₂₊＝v_1-, and v ₁₊＝v_0-. The "yes" and "no" indications in the following table indicate which texel addresses are generated in this case, and 25 of the possible 64 texel addresses are generated again:

In the same case, but with both the horizontal and vertical dimensions flipped, u ₃₊＝u_2-,u₂₊＝u_1-, and u ₁₊＝u_0-, and v ₃₊＝v_2-,v₂₊＝v_1-, and v ₁₊＝v_0-. The "yes" and "no" indications in the following table indicate which texel addresses are generated in this case, and 25 of the possible 64 texel addresses are generated again:

In the example given above, the texture filtering applied by TF 216 is bilinear filtering, two integer texel coordinates are determined for each of the texture coordinates by texture-to-texel conversion logic 604 of TAG 212, and for each of the segments of the block, a filtered value is determined by determining the results of bilinear interpolation of four of the extracted texels. More generally, the texture filtering applied by TF 216 may be a two-dimensional polynomial filtering using a polynomial having degree d, where d.gtoreq.1, where (d+1) integer texel coordinates are determined for each of the texture coordinates by texture-to-texel conversion logic 604 of TAG 212. In this general case of polynomial filtering, for each of the segments of the block, the filtered value is determined by determining the result of a two-dimensional polynomial interpolation of (d+1) ² of the extracted texels, where the polynomial interpolation uses a polynomial with degree d. It should be noted that bilinear filtering is a two-dimensional polynomial filtering using a polynomial with degree d, where d=1. As another example, bicubic filtering using a polynomial with degree 3 may be implemented, wherein for each of the segments of the block, a filtered value is determined by determining the result of bicubic interpolation of the 4x4 block of extracted texels.

Returning to the flowchart shown in fig. 3, if the fragment processing unit 206 determines in step S304 that the texture coordinates of the fragments for the fragment block are not axis aligned, the method goes from step S304 to step S320. In step S320, in response to detecting that the texture coordinates of the segment for the block are not axis aligned, the segment processing unit 206 sends the non-reduced texture coordinate set to the texture processing unit 208. For example, the non-reduced texture coordinate set may include a U value and a V value for each of the segments of the block. In this case, the method proceeds to step S308 as described above, in which the texture processing unit 208 processes the non-reduced texture coordinate set to generate a texel address of the texel to be extracted. The method may then continue as described above.

It should be noted that the method shown in fig. 3 may be performed for a plurality of segment blocks (e.g., where each segment block (e.g., each 4x4 block) corresponds to a block of pixels within a larger image), and it may be the case that some of the blocks have axis pairs Ji Wenli coordinates, while some other of the blocks do not have axis aligned texture coordinates. Each of the blocks may be processed separately such that those blocks having axis aligned texture coordinates may be processed to achieve the benefits described herein (e.g., using a reduced set of texture coordinates and/or using a uniqueness process in TAG 212), while those blocks not having axis aligned texture coordinates may be processed to not achieve the benefits (e.g., using a complete set of texture coordinates and not using a uniqueness process in TAG 212).

The examples described above have very significant benefits in terms of performance (or "latency"), power consumption, and/or silicon area of GPU 202, wherein in response to detecting that texture coordinates for a fragment of a fragment block are axis aligned, one or both of: (i) A texture coordinate set is reduced (e.g., from 16U-values and 16V-values to only 4U-values and 4V-values), and (ii) a uniqueness process is performed on the integer texel coordinates to reduce the number of texel addresses generated. For example, improvements in PPA (power, performance, area) factors on the order of 10% can be achieved for application of post-processing using axis aligned textures.

In the example detailed above, the segment blocks are processed. In general, a "fragment" may be a data element, such as an image data element (such as a primitive fragment or pixel) or a non-image data element, for example, when processing a computing workload.

In some examples, the TPU may be implemented within the GPU for applying post-processing to the output of the camera pipeline.

Furthermore, the example described above implements a TPU within a GPU, but techniques to implement the method described above in a processor other than the GPU would be possible. In general, GPU 202 described above is an example of a processor, fragment processing unit 206 is an example of a data processing unit, texture processing unit 208 is an example of a data loading unit, fragment blocks are examples of blocks of data items, and textures are examples of stored data arrays, wherein texels of the textures are examples of data array elements of the data array. For example, the techniques described herein may be used in a parallel processor that is not necessarily a GPU, e.g., for processing data that is not necessarily graphics data. For example, the techniques may be implemented in a Single Instruction Multiple Data (SIMD) processor having some dedicated data loading function (similar to the function described above with respect to texture processing unit 208) for reading a data array. To name a few additional examples, the techniques described herein may be used to: (i) reading a matrix for the computational operation, (ii) reading weights (or weight matrix) for the neural network, (iii) reading a scientific data array from the sensor, and (iv) reading the neural network data. To take some more examples, the techniques described herein may be used for any of the following: (i) Processing matrices for linear algebra, engineering/scientific calculations, physical simulation, fluid flow, molecular modeling, and weather forecast; (ii) Data transformation, such as Fast Fourier Transform (FFT), encoding, encryption; (iii) A data search/classification/filtering, graph-based approach; and (iv) neural networks, AI, speech analysis, and language models.

In these more general examples, a processor may be operative to retrieve a block of data items, wherein each of the data items is associated with coordinates for each of a plurality of dimensions of the stored data array. The data processing unit of the processor detects that the coordinates associated with the data items of the block are axis aligned. In response to detecting that the coordinates of the data items for the block are axis aligned, the following are sent to the data loading unit: (i) Only one coordinate for a first dimension of each row of data items aligned in a first dimension within the block, and (ii) only one coordinate for a second dimension of each row of data items aligned in a second dimension within the block, the second dimension being orthogonal to the first dimension. The data loading unit processes the coordinates to generate an address of a data array element to be extracted from the stored data array. The data loading unit then uses the generated address to extract the data array element from the stored data array. The data loading unit determines a data item value for each of the data items of the block based on the extracted data array elements. The data loading unit may then output the data item value.

The data processing unit may execute a compute shader program. The output data item value may be input from the data loading unit to the compute shader program.

As mentioned above, the processor may be a SIMD parallel processor, and each data item in a block may be associated with a processing channel of the SIMD parallel processor. In the above example, the axis alignment occurs when the coordinate axes of the data items for the block (U and V in the previous example) are aligned with the X-axis and Y-axis of the block.

Further, in these more general examples, the processor may be operable to retrieve a block of data items, wherein each of the data items is associated with coordinates for each of a plurality of dimensions of the stored data array. The data processing unit of the processor detects that the coordinates associated with the data items of the block are axis aligned. A data loading unit of the processor determines two or more integer coordinates for each coordinate in the set of coordinates and performs a uniqueness process on the determined integer coordinates to remove one or more duplicate integer coordinates and thereby determine a subset of the determined integer coordinates. The data loading unit generates an address of a data array element to be extracted from the stored data array using the determined subset of integer coordinates, and extracts the data array element from the stored data array using the generated address. For each of the data items of the block, the data loading unit uses the extracted subset of data array elements to determine a data item value and outputs the data item value.

FIG. 11 illustrates a computer system in which the graphics processing system described herein may be implemented. The computer system includes a CPU 1102, a GPU 1104, a memory 1106, a Neural Network Accelerator (NNA) 1108, and other devices 1114, such as a display 1116, speakers 1118, and a camera 1122. One or more processing blocks 1110 (corresponding to fragment processing unit 206 and texture processing unit 208) are implemented on GPU 1104. In other examples, one or more of the depicted components may be omitted from the system, and/or processing block 1110 may be implemented on CPU 1102 or within NNA 1108. The components of the computer system may communicate with each other via a communication bus 1120. Storage 1112 (corresponding to memory 204) is implemented as part of memory 1106.

The GPU and TPU of fig. 2 and 6 are shown as comprising a plurality of functional blocks. This is merely illustrative and is not intended to limit the strict division between the different logic elements of such entities. Each of the functional blocks may be provided in any suitable manner. It should be understood that intermediate values described herein formed by a graphics processing unit need not be physically generated by the graphics processing unit at any point in time, and may represent only logical values that conveniently describe the processing performed by the graphics processing unit between its inputs and outputs.

The graphics processing units described herein may be embodied in hardware on an integrated circuit. The graphics processing unit described herein may be configured to perform any of the methods described herein. In general, any of the functions, methods, techniques or components described above may be implemented in software, firmware, hardware (e.g., fixed logic circuitry) or any combination thereof. The terms "module," "functionality," "component," "element," "unit," "block," and "logic" may be used herein to generally represent software, firmware, hardware, or any combination thereof. In the case of a software implementation, the module, functionality, component, element, unit, block or logic represents program code that performs specified tasks when executed on a processor. The algorithms and methods described herein may be executed by one or more processors executing code that causes the processors to perform the algorithms/methods. Examples of a computer-readable storage medium include Random Access Memory (RAM), read-only memory (ROM), optical disks, flash memory, hard disk memory, and other memory devices that can store instructions or other data using magnetic, optical, and other techniques and that can be accessed by a machine.

The terms computer program code and computer readable instructions as used herein refer to any kind of executable code for a processor, including code expressed in a machine language, an interpreted language, or a scripting language. Executable code includes binary code, machine code, byte code, code defining an integrated circuit (e.g., a hardware description language or netlist), and code expressed in programming language code such as C, java or OpenCL. The executable code may be, for example, any kind of software, firmware, script, module, or library that, when properly executed, handled, interpreted, compiled, run in a virtual machine or other software environment, causes the processor of the computer system supporting the executable code to perform the tasks specified by the code.

The processor, computer, or computer system may be any kind of device, machine, or special purpose circuit, or a set or portion thereof, that has processing capabilities such that it can execute instructions. The processor may be or include any kind of general purpose or special purpose processor, such as CPU, GPU, NNA, a system on a chip, a state machine, a media processor, an Application Specific Integrated Circuit (ASIC), a programmable logic array, a Field Programmable Gate Array (FPGA), or the like. The computer or computer system may include one or more processors.

The present invention is also intended to cover software defining the configuration of hardware as described herein, such as Hardware Description Language (HDL) software, for designing integrated circuits or for configuring programmable chips to perform desired functions. That is, a computer readable storage medium having encoded thereon computer readable program code in the form of an integrated circuit definition data set may be provided, which when processed (i.e., run) in an integrated circuit manufacturing system configures the system to manufacture a graphics processing unit configured to perform any of the methods described herein, or to manufacture a graphics processing unit comprising any of the apparatus described herein. The integrated circuit definition data set may be, for example, an integrated circuit description.

Accordingly, a method of manufacturing a graphics processing unit as described herein at an integrated circuit manufacturing system may be provided. Furthermore, an integrated circuit definition data set may be provided that, when processed in an integrated circuit manufacturing system, enables a method of manufacturing a graphics processing unit to be performed.

The integrated circuit definition data set may be in the form of computer code, for example, as a netlist, code for configuring a programmable chip, as a hardware description language defining a hardware suitable for fabrication at any level in an integrated circuit, including as Register Transfer Level (RTL) code, as a high-level circuit representation (such as Verilog or VHDL), and as a low-level circuit representation (such as OASIS (RTM) and GDSII). A higher-level representation (e.g., RTL) that logically defines hardware suitable for fabrication in an integrated circuit may be processed at a computer system configured to generate fabrication definitions for the integrated circuit in the context of a software environment that includes definitions of circuit elements and rules for combining these elements to generate fabrication definitions for the integrated circuit so defined by the representation. As is typically the case when software is executed at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate a manufacturing definition for an integrated circuit to execute code that defines the integrated circuit to generate the manufacturing definition for the integrated circuit.

An example of processing an integrated circuit definition data set at an integrated circuit manufacturing system to configure the system to manufacture a graphics processing unit will now be described with reference to fig. 12.

Fig. 12 illustrates an example of an Integrated Circuit (IC) fabrication system 1202 configured to fabricate a graphics processing unit as described in any of the examples herein. In particular, IC fabrication system 1202 includes layout processing system 1204 and integrated circuit generation system 1206. The IC fabrication system 1202 is configured to receive an IC definition data set (e.g., defining a graphics processing unit as described in any of the examples herein), process the IC definition data set, and generate an IC (e.g., embodying a graphics processing unit as described in any of the examples herein) from the IC definition data set. The IC fabrication system 1202 is configured to fabricate an integrated circuit embodying a graphics processing unit as described in any of the examples herein through processing of the IC definition data set.

Layout processing system 1204 is configured to receive and process the IC definition data set to determine a circuit layout. Methods of determining a circuit layout from an IC definition dataset are known in the art and may involve, for example, synthesizing RTL codes to determine a gate level representation of a circuit to be generated, for example in terms of logic components (e.g., NAND, NOR, AND, OR, MUX and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout may be determined from the gate level representation of the circuit. This may be done automatically or with the participation of a user in order to optimize the circuit layout. When the layout processing system 1204 has determined a circuit layout, the layout processing system may output the circuit layout definition to the IC generation system 1206. The circuit layout definition may be, for example, a circuit layout description.

As is known in the art, an IC generation system 1206 generates ICs from circuit layout definitions. For example, the IC generation system 1206 may implement a semiconductor device fabrication process that generates ICs, which may involve a multi-step sequence of photolithography and chemical processing steps during which electronic circuits are built up on wafers made of semiconductor material. The circuit layout definition may be in the form of a mask that may be used in a lithographic process for generating an IC from the circuit definition. Alternatively, the circuit layout definitions provided to the IC generation system 1206 may be in the form of computer readable code that the IC generation system 1206 may use to form a suitable mask for generating the IC.

The different processes performed by IC fabrication system 1202 may all be implemented in one location, e.g., by a party. Alternatively, IC fabrication system 1202 may be a distributed system, such that some processes may be performed at different locations and by different parties. For example, some of the following phases may be performed at different locations and/or by different parties: (i) Synthesizing an RTL code representing the IC definition dataset to form a gate level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate level representation; (iii) forming a mask according to the circuit layout; and (iv) using the mask to fabricate the integrated circuit.

In other examples, the processing of the integrated circuit definition data set at the integrated circuit manufacturing system may configure the system to manufacture a graphics processing unit in which the integrated circuit definition data set is not processed in order to determine the circuit layout. For example, an integrated circuit definition dataset may define a configuration of a reconfigurable processor, such as an FPGA, and processing of the dataset may configure the IC manufacturing system to generate (e.g., by loading configuration data into the FPGA) the reconfigurable processor having the defined configuration.

In some embodiments, the integrated circuit manufacturing definition data set, when processed in the integrated circuit manufacturing system, may cause the integrated circuit manufacturing system to generate an apparatus as described herein. For example, an apparatus as described herein may be manufactured by configuring an integrated circuit manufacturing system in the manner described above with reference to fig. 12 through an integrated circuit manufacturing definition dataset.

In some examples, the integrated circuit definition dataset may contain software running on or in combination with hardware defined at the dataset. In the example shown in fig. 12, the IC generation system may be further configured by the integrated circuit definition data set to load firmware onto the integrated circuit in accordance with the program code defined at the integrated circuit definition data set at the time of manufacturing the integrated circuit or to otherwise provide the integrated circuit with the program code for use with the integrated circuit.

Specific implementations of the concepts set forth in the present disclosure in devices, apparatus, modules, and/or systems (and in methods implemented herein) may provide improved performance over known specific implementations. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and/or reduced power consumption. During the manufacture of such devices, apparatuses, modules and systems (e.g., in integrated circuits), a tradeoff may be made between performance improvements and physical implementation, thereby improving the manufacturing method. For example, a tradeoff can be made between performance improvement and layout area, matching the performance of a known implementation, but using less silicon. This may be accomplished, for example, by reusing the functional blocks in a serial fashion or sharing the functional blocks among elements of an apparatus, device, module, and/or system. Rather, the concepts described herein that lead to improvements in the physical implementation of devices, apparatus, modules and systems (e.g., reduced silicon area) can be weighed against performance improvements. This may be accomplished, for example, by fabricating multiple instances of the module within a predefined area budget.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Appendix

Some numbering clauses are provided below, which form part of the present disclosure:

1. A method of retrieving a block of data items in a processor, each of the data items being associated with coordinates for each of a plurality of dimensions of a stored data array, the method comprising:

Data loading unit of processor:

Outputting the data item value.

2. The method of clause 1, further comprising performing a de-uniqueness process on the extracted data array elements to determine which of the extracted data array elements are included in the subset of each of the data items for the block.

3. The method of clause 1 or 2, wherein the determining the data item value comprises performing bilinear interpolation on the subset of extracted data array elements, wherein the determining two or more integer coordinates for each coordinate in a set of coordinates comprises accurately determining two integer coordinates for each coordinate in the set of coordinates, and

Wherein for each of the data items of the block, the subset includes four of the extracted data array elements, and the data item value for the data item is determined by determining the result of bilinear interpolation of those four extracted data array elements.

4. The method of clause 3, wherein for each of the data items of the block, four pairs of integer coordinates correspond to four addresses of the four extracted data array elements of the subset.

5. The method of clause 3 or 4, wherein the two integer coordinates determined for each of the coordinates are: (i) A first integer coordinate corresponding to the coordinate rounded down to an integer data array element position, and (ii) a second integer coordinate, the second integer coordinate being the integer coordinate greater than the first integer coordinate.

6. The method of any of clauses 3 to 5, wherein the block of data items is a mxn block of data items, wherein the stored data array is a 2D array such that each data item is associated with coordinates for a first dimension and coordinates for a second dimension orthogonal to the first dimension, and wherein the subset of determined integer coordinates comprises n+1 integer coordinates for the first dimension and m+1 integer coordinates for the second dimension.

7. The method of clause 6, wherein the extracted data array element represents a (m+1) x (n+1) data array element block of the stored data array.

8. The method of clause 7, wherein n = m = 4,

Wherein the data loading unit comprises 32 address generators, wherein each address generator can generate an address of a data array element to be extracted within each clock cycle, and wherein within one clock cycle the addresses of the data array elements in a 5x5 block of data array elements are generated using 25 of the address generators.

9. The method of clause 7, wherein m=4 and n=2, or wherein m=2 and n=4,

Wherein the data loading unit comprises 16 address generators, wherein each address generator can generate an address of a data array element to be extracted within each clock cycle, and wherein within one clock cycle the addresses of the data array elements in a 5x3 or 3x5 data array element block are generated using 15 of the address generators.

10. The method of any of clauses 3-9, further comprising determining a fractional portion of a data array element position corresponding to each of the coordinates of the set, wherein a first interpolation weight for the bilinear interpolation of a data item is based on the determined fractional portion of a data array element position corresponding to the coordinate associated with the data item for a first dimension, and wherein a second interpolation weight for the bilinear interpolation of the data item is based on the determined fractional portion of a data array element position corresponding to the coordinate associated with the data item for a second dimension, the second dimension orthogonal to the first dimension.

11. The method of clause 10, further comprising: before generating the address, detecting that the determined fractional part of the data array element position corresponding to the coordinate is zero, and in response to this detection, determining that the result of the bilinear interpolation of the four data array elements does not need to be determined for two of the four data array elements of the bilinear interpolation of the data item associated with the coordinate.

12. The method of any preceding clause, wherein in response to determining that there are a sufficient number of repeated determined integer coordinates, performing the uniqueness process on the determined integer coordinates.

13. The method of clause 12 when dependent on clause 6 or 7, wherein m = n = 4, and wherein the determining that there are a sufficient number of repeated determined integer coordinates includes determining whether all six of the following expressions are satisfied:

(u₀₊＝u_1-)∨(u₀₊＝u_2-)∨(u₀₊＝u_3-)∨(u₀₊＝u₃₊)

(u₁₊＝u_0-)∨(u₁₊＝u_2-)∨(u₁₊＝u_3-)∨(u₁₊＝u₃₊)

(u₂₊＝u_0-)∨(u₂₊＝u_1-)∨(u₂₊＝u_3-)∨(u₂₊＝u₃₊)

(v₀₊＝v_1-)∨(v₀₊＝v_2-)∨(v₀₊＝v_3-)∨(v₀₊＝v₃₊)

(v₁₊＝v_0-)∨(v₁₊＝v_2-)∨(v₁₊＝v_3-)∨(v₁₊＝v₃₊)

(v₂₊＝v_0-)∨(v₂₊＝v_1-)∨(v₂₊＝v_3-)∨(v₂₊＝v₃₊)

where u _i- and u _i+ are the two integer coordinates in the first dimension for each of the data items in the ith column of the block of data items, where i e 0,1,2,3;

Wherein v _j- and v _j+ are the two integer coordinates in the second dimension for each of the data items in the j-th row of the block of data items, where j e 0,1,2,3; and

Wherein v represents a logical OR operation.

14. The method of clause 13, wherein the integer coordinates of the 5x5 data array element block are:

15. the method of clause 1 or 2, wherein the determining the data item value comprises performing a two-dimensional polynomial filtering on the subset of extracted data array elements using a polynomial having a degree d, wherein d >1, wherein the determining two or more integer coordinates for each coordinate in a set of coordinates comprises determining (d+1) integer coordinates for each coordinate in the set of coordinates, and

Wherein for each of the data items of the block, the subset comprises (d+1) ² of the extracted data array elements, and the data item value for the data item is determined by determining the result of two-dimensional polynomial interpolation of that (d+1) ² extracted data array elements, the polynomial interpolation using the polynomial with degree d.

16. The method of any preceding clause, wherein one or both of the dimensions of the stored data array are flipped relative to the dimensions of the data item block.

17. The method of any preceding clause, wherein a pair of integer coordinates of a first data item for the block is the same as a pair of integer coordinates of a second data item for the block, and wherein the address corresponding to the pair of integer coordinates is generated once for processing the block of data items due to the uniqueness process.

18. A method as in any preceding clause, wherein the uniqueness process makes all of the addresses generated for processing the block of data items unique.

19. The method of any preceding clause, wherein the set of coordinates is a reduced set of coordinates comprising only one coordinate for a first dimension of each row of data items aligned in the first dimension within the block of data items, and only one coordinate for a second dimension of each row of data items aligned in the second dimension within the block of data items, the second dimension orthogonal to the first dimension.

20. The method of any preceding clause, wherein the detecting that the coordinates associated with the data item of the block are axis aligned comprises: determining that a coordinate axis associated with the data item of the block is aligned with an axis of the data item block.

21. A processor configured to retrieve a block of data items, each of the data items being associated with coordinates for each of a plurality of dimensions of a stored data array, the processor comprising a data processing unit and a data loading unit,

Wherein the data processing unit is configured to:

Transmitting the coordinate set to a data loading unit; and

Wherein the data loading unit is configured to:

Outputting the data item value.

22. The processor of clause 21, wherein the data loading unit comprises:

An address generation module configured to: (i) determining the two or more integer coordinates for each of the coordinates, (ii) performing the uniqueness process on the determined integer coordinates, and (iii) generating the address of the data array element to be extracted using the subset of the determined integer coordinates;

an address processing module configured to extract the data array element using the generated address; and

A filtering module configured to: (i) Determining the data item value for each of the data items of the block by applying filtering to the extracted subset of data array elements, and (ii) outputting the data item value.

23. A processor configured to perform the method of any one of clauses 1 to 20.

24. Computer readable code configured such that when the code is run, the method of any of clauses 1 to 20 is performed.

25. An integrated circuit definition data set that, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture the processor of any one of clauses 21 to 23.

Claims

1. A method of applying texture filtering to a block of segments in a Graphics Processing Unit (GPU), each of the segments being associated with texture coordinates for each of a plurality of dimensions of a texture, the method comprising:

Detecting that the texture coordinates for the segment of the block are axis aligned;

Generating a texel address of the texel to be extracted using the subset of the determined integer texel coordinates;

Extracting texels using the generated texel address;

And outputting the filtering value.

2. The method of claim 1, further comprising performing a de-uniqueness process on the extracted texels to determine which of the extracted texels are included in the subset for each of the fragments of the block.

3. The method of claim 1 or 2, wherein the texture filtering is bilinear filtering, wherein the determining two or more integer texel coordinates for each texture coordinate in a set of texture coordinates comprises accurately determining two integer texel coordinates for each texture coordinate in the set of texture coordinates, and

Wherein for each of the segments of the block, the subset includes four of the extracted texels, and the filtered value for the segment is determined by determining the result of bilinear interpolation of those four extracted texels.

4. A method as claimed in claim 3, wherein the two integer texel coordinates determined for each of the texture coordinates are: (i) A first integer texel coordinate corresponding to the texture coordinate rounded down to an integer texel position; and (ii) a second integer texel coordinate, the second integer texel coordinate being the integer texel coordinate greater than the first integer texel coordinate.

5. The method of claim 3 or 4, wherein a level of detail of the texture filtering corresponds to a 1:1 mapping between a pitch of the segments in the block and a pitch of the texels in the texture.

6. The method of any of claims 3 to 5, wherein the segment blocks are mxn segment blocks, wherein the texture is a 2D texture such that each segment is associated with texture coordinates for a horizontal dimension and texture coordinates for a vertical dimension, and wherein the subset of integer texel coordinates determined includes n+1 integer texel coordinates for the horizontal dimension and m+1 integer texel coordinates for the vertical dimension.

7. The method of claim 6, wherein the extracted texels represent a (m+1) x (n+1) texel block of the texture.

8. The method of any of claims 3-7, the method further comprising determining a fractional portion of texel positions corresponding to each of the texture coordinates of the set, wherein the horizontal interpolation weight for the bilinear interpolation of a segment is based on the determined fractional portion corresponding to texel positions of the texture coordinates associated with the segment for a horizontal dimension, and wherein the vertical interpolation weight for the bilinear interpolation of the segment is based on the determined fractional portion corresponding to texel positions of the texture coordinates associated with the segment for a vertical dimension.

9. The method of claim 8, the method further comprising: before generating the texel address, detecting that the determined fractional part of the texel position corresponding to texture coordinates is zero, and in response to this detection, determining that two texels of the four texels for which bilinear interpolation for the segment associated with the texture coordinates is not required to determine a result of the bilinear interpolation of the four texels.

10. The method of any preceding claim, wherein the process of uniqueness is performed on the determined integer texel coordinates in response to determining that there is a sufficient number of repeated determined integer texel coordinates.

11. The method of claim 10 when dependent on claim 6 or 7, wherein m = n = 4, and wherein said determining that there is a sufficient number of repeated determined integer texel coordinates comprises determining whether all six of the following expressions are satisfied:

(u₀₊＝u_1-)∨(u₀₊＝u_2-)∨(u₀₊＝u_3-)∨(u₀₊＝u₃₊)

(u₁₊＝u_0-)∨(u₁₊＝u_2-)∨(u₁₊＝u_3-)∨(u₁₊＝u₃₊)

(u₂₊＝u_0-)∨(u₂₊＝u_1-)∨(u₂₊＝u_3-)∨(u₂₊＝u₃₊)

(v₀₊＝v_1-)∨(v₀₊＝v_2-)∨(v₀₊＝v_3-)∨(v₀₊＝v₃₊)

(v₁₊＝v_0-)∨(v₁₊＝v_2-)∨(v₁₊＝v_3-)∨(v₁₊＝v₃₊)

(v₂₊＝v_0-)∨(v₂₊＝v_1-)∨(v₂₊＝v_3-)∨(v₂₊＝v₃₊)

Where u _i- and u _i+ are the two integer texel coordinates in the horizontal dimension for each of the fragments in the ith column of the fragment block, where i e 0,1,2,3;

Wherein v _j- and v _j+ are the two integer texel coordinates in the vertical dimension for each of the fragments in the j-th row of the fragment block, where j e 0,1,2,3; and

Wherein v represents a logical OR operation.

12. The method of claim 11, wherein the integer texel coordinates of a 5x5 texel block are:

13. The method of claim 1 or 2, wherein the texture filtering is a two-dimensional polynomial filtering using a polynomial having a degree d, wherein d >1, wherein the determining two or more integer texel coordinates for each texture coordinate in a set of texture coordinates comprises determining (d+1) integer texel coordinates for each texture coordinate in the set of texture coordinates, and

Wherein for each of the segments of the block, the subset includes (d+1) ² of the extracted texels, and the filtered value for the segment is determined by determining the result of two-dimensional polynomial interpolation of that (d+1) ² extracted texels, the polynomial interpolation using the polynomial with the degree d.

14. The method of any preceding claim, wherein one or both of the horizontal and vertical dimensions of the texture are flipped relative to the dimension of the segment block.

15. The method of any preceding claim, wherein a pair of integer texel coordinates for a first segment of the block is the same as a pair of integer texel coordinates for a second segment of the block, and wherein the texel address corresponding to the pair of integer texel coordinates is generated once for processing the segment block as a result of the unimization process.

16. A method as claimed in any preceding claim, wherein the uniqueness process makes all of the texel addresses generated for processing the fragment block unique.

17. The method of any preceding claim, wherein the set of texture coordinates is a reduced set of texture coordinates comprising:

18. A graphics processing unit configured to apply texture filtering to a block of segments, each of the segments being associated with texture coordinates for each of a plurality of dimensions of a texture, the graphics processing unit comprising a segment processing unit and a texture processing unit,

Wherein the fragment processing unit is configured to:

detecting that the texture coordinates for the segment of the block are axis aligned; and

Sending a texture coordinate set to the texture processing unit; and

Wherein the texture processing unit is configured to:

determining two or more integer texel coordinates for each texture coordinate in the set of texture coordinates;

Extracting texels using the generated texel address;

And outputting the filtering value.

19. The graphics processing unit of claim 18, wherein the texture processing unit comprises:

A texture address generation module configured to: (i) determining the two or more integer texel coordinates for each of the texture coordinates, (ii) performing the uniqueness process on the determined integer texel coordinates, and (iii) generating the texel address of the texel to be extracted using the subset of the determined integer texel coordinates;

an address processing module configured to extract the texels using the generated texel addresses; and

A texture filtering module configured to: (i) Determining the filtered value for each of the segments of the block by applying filtering to a subset of the extracted texels, and (ii) outputting the filtered value.

20. Computer readable code configured such that when the code is run, the method of any of claims 1 to 17 is performed.

21. An integrated circuit definition data set which, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture a graphics processing unit as claimed in claim 18 or 19.