CN119068099A - A light stack manager, graphics processing method and system - Google Patents
A light stack manager, graphics processing method and system Download PDFInfo
- Publication number
- CN119068099A CN119068099A CN202310649821.7A CN202310649821A CN119068099A CN 119068099 A CN119068099 A CN 119068099A CN 202310649821 A CN202310649821 A CN 202310649821A CN 119068099 A CN119068099 A CN 119068099A
- Authority
- CN
- China
- Prior art keywords
- stack
- ray
- module
- request
- manager
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 13
- 238000012545 processing Methods 0.000 claims abstract description 87
- 230000015654 memory Effects 0.000 claims description 74
- 230000001133 acceleration Effects 0.000 claims description 31
- 238000000034 method Methods 0.000 claims description 28
- 238000012360 testing method Methods 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 16
- 238000007726 management method Methods 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 15
- 238000009877 rendering Methods 0.000 claims description 8
- 230000004044 response Effects 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000013500 data storage Methods 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 11
- 239000008186 active pharmaceutical agent Substances 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 230000001934 delay Effects 0.000 description 3
- 238000005286 illumination Methods 0.000 description 3
- 230000009977 dual effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000004064 recycling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000000725 suspension Substances 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/06—Ray-tracing
Landscapes
- Engineering & Computer Science (AREA)
- Computer Graphics (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Generation (AREA)
Abstract
The invention discloses a light ray stack manager, a graph processing method and a system. The light ray stack manager comprises a plurality of management units, each management unit comprises a stack storage module and an access logic module, the stack storage module is used for storing stacks of corresponding light rays, the access logic module is used for receiving stack requests and managing the stacks in the stack storage modules corresponding to the stack requests according to the stack requests, each light ray corresponds to one stack, and each stack stores the traversing state of the corresponding light ray. The invention can maintain parallel processing of a plurality of rays, effectively manage a plurality of stacks, improve the efficiency of ray traversal processing and optimize the performance of ray tracing.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a ray stack manager, a graph processing method and a graph processing system.
Background
In the field of computer graphics, a graphics processing unit (Graphics Processing Unit, GPU) is a special-purpose chip that performs functions related to rendering images to a screen. User-created applications may connect with the GPU through a graphics application programming interface (Application Programming Interface, API). Common graphics APIs include, for example OpenGL, vulkan and Direct X.
Currently, some APIs have been extended to support a rendering technique called ray tracing (RAY TRACING). Ray tracing itself is not a new technology, but consumer GPUs that exclusively support ray tracing functionality have only been produced in the last 10 years. Ray tracing is a graphics technique in which primitives (usually triangles) describing a scene intersect one or more lines or line segments (called rays) to determine the primitive that each ray intersects, and usually also find the primitive closest to the point on the ray (called the ray origin). These queries for looking up primitives for ray "-intersections" are very useful in graphics because rays can be used to model rays in a scene, and the results can be used to determine how the ray effects should be processed in scene rendering.
To avoid testing each ray for each triangle in the scene, primitives are typically grouped together hierarchically using acceleration structures, commonly referred to as "trees". The ray will first test for intersection with the root node of the acceleration structure, which represents the volume covering the entire scene. If and only if the ray intersects the volume, then it can be moved down the tree structure, testing the direct child nodes of the root node. Each of these child nodes represents a volume that covers a smaller portion of the scene, so that as the ray traverses down the tree structure, it will intersect with the smaller and smaller volume. Eventually, it will reach the leaf level of the tree structure and test a small number of primitives directly. Traversing the entire tree structure in this manner will know which primitives the ray intersects without having to test each primitive individually.
With the increasing complexity of graphics processing, it is necessary to further improve the performance of ray tracing.
Disclosure of Invention
Aiming at the improvement requirement of the prior art, the invention provides a ray stack manager, a graph processing method and a graph processing system, which have good stack management capability and can improve the parallelism of ray traversal, thereby effectively improving the performance of ray tracing.
To achieve the above object, according to one aspect of the present invention, there is provided a light stack manager, including a plurality of management units, each of the management units including a stack storage module for storing a stack of corresponding lights, and an access logic module for receiving a stack request and managing the stacks in the stack storage module corresponding thereto according to the stack request, wherein each light corresponds to one stack, and each stack stores a traversing state of its corresponding light.
In some embodiments, each of the stacks includes at least one stack block, each of the stack blocks including a plurality of entries, wherein the stack blocks in different stacks have the same size.
In some embodiments, each access logic unit comprises a front-end processing module for receiving the stack request and classifying the stack request according to a stack state in the stack storage module, and a plurality of processing modules for responding to the stack request according to the classification of the front-end processing module.
In some embodiments, the plurality of processing modules includes a first processing module for controlling popping of an entry at the top of the stack in response to a pop request when the stack block has more than one entry.
In some embodiments, the plurality of processing modules includes a second processing module, when the stack block has only one entry, the second processing module is configured to control popping the entry at the top of the stack block and to control performing a prefetch operation on the stack block below the stack block in response to a pop request.
In some implementations, the plurality of processing modules includes a third processing module for controlling writing back a current stack block to memory and updating the current stack block with new data when the stack request is a push request and the current stack block overflows.
In some implementations, the new data is data in a stack block next to the current stack block.
In some implementations, the plurality of processing modules includes a fourth processing module for controlling updating a current stack block when the stack request is a push request and the current stack block is not overflowed.
In some embodiments, each of the access logic units further includes a stack size storage module, where the stack size storage module is configured to store a size of a stack, and the front-end processing module determines whether overflow or underflow occurs to the stack block according to the stack size information stored in the stack size storage module.
In some implementations, each of the access logic units further includes a row state storage module for storing a state of a stack block to determine whether the stack block is waiting for a row fill.
In some implementations, the row status storage module is further to track dirty bits for each stack block, write data in the stack block back to memory when the stack block has dirty bits, wherein the dirty bits are to indicate whether the data of the stack block is overwritten.
In some embodiments, each access logic unit further comprises a delay request module for storing a stack request when a stack block corresponding to the stack request waits for linefill, and for suspending an input interface of the light stack manager when the stack request is full.
In some embodiments, the plurality of processing modules are further configured to update, by an arbiter, stack blocks in the stack storage module, and to generate a linefill request and/or a write-back request to the memory.
In some embodiments, the stack storage module is a dual port sram, and each stack entry includes a pointer to at least one node that the corresponding ray needs to traverse, the node being determined from an acceleration structure of ray tracing.
In some embodiments, the light stack manager further comprises a memory interface multiplexed by a plurality of the management units for data interaction with memory, wherein the data interaction comprises linefill and write-back.
According to another aspect of the present invention, there is provided a ray tracing kernel including a ray stack manager as described above, the ray tracing kernel further including a ray scheduling module for generating a ray index, a ray intersection module for selecting rays from a plurality of rays that enter an intersection test, and a ray traversal module for performing ray-primitive bounding boxes and/or ray-polygon intersection tests, the ray traversal module for sending a stack request to the ray stack manager to determine a next node to be traversed by each ray that passes the intersection test, the ray stack manager for receiving the stack request and managing a stack state of each ray.
In some embodiments, the ray tracing kernel further comprises a ray data storage module for storing information of a plurality of rays during ray traversal, wherein the information comprises at least one of ray origin, ray direction, ray minimum distance, ray maximum distance, and ray hit nearest primitive, or a combination thereof.
In some embodiments, the ray tracing kernel further includes a node data acquisition module for acquiring node information of an acceleration structure from a hierarchical memory, and for sending the ray index and the node information received from the ray scheduling module to the ray intersection module.
In some embodiments, the ray intersection module includes a conversion unit for converting rays from a world space coordinate system to a model space coordinate system.
In some embodiments, the ray traversal module is further configured to send a pointer back to the ray scheduling module after determining a next node to be traversed by each ray that passes the intersection test.
In some embodiments, the ray tracing kernel processes multiple rays in parallel per cycle.
According to still another aspect of the present invention, there is provided a graphics processing system including the ray tracing kernel as described above, the graphics processing system further including a plurality of shader kernels, the plurality of shader kernels being configured to invoke ray tracing functions, send initialization information required for ray traversal to the ray tracing kernels, the ray tracing kernels being configured to perform ray traversal according to the initialization information and return results, and the plurality of shader kernels being further configured to perform illumination calculation in rendering according to the return results.
According to still another aspect of the present invention, there is provided a graphic processing method including:
maintaining a stack for each ray to be traversed using a ray stack manager in a ray tracing kernel as described above to store traversal state information for each of the rays;
Determining the next node to be traversed by each ray according to the stack request and the traversing state information until the traversing of all rays is completed and determining traversing results;
and returning the traversing result to the shader kernel for subsequent rendering.
According to yet another aspect of the present invention there is provided an electronic device comprising the above-described light ray stack manager, or the above-described light ray tracing kernel, or the above-described graphics processing system, or the electronic device comprising a processor, a memory communicatively coupled to the processor, the memory storing instructions executable by the processor, the instructions being executable by the processor to enable the processor to perform the above-described method.
According to yet another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions which, when executed by a processor, perform the above method.
In general, compared with the prior art, the technical scheme has the advantages that one stack is managed for each ray through a plurality of parallel processing units, each stack is divided into a plurality of stack blocks, the stack blocks at the top end of each ray are kept in the stack storage module of the processing unit, and the plurality of rays can be kept to be processed in parallel, so that the plurality of stacks are effectively managed, the efficiency of ray traversal processing is improved, and the performance of ray tracing is optimized.
Drawings
FIG. 1 is a schematic diagram of a set of triangles in a two-dimensional coordinate system in accordance with an embodiment of the present invention;
FIG. 2 is a topological view of an embodiment of the invention in the same acceleration structure as FIG. 1;
FIG. 3 is a schematic diagram of a ray stack manager in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram of a stack block according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a ray tracing kernel according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a graphics processing system in accordance with an embodiment of the present invention;
FIG. 7 is a flow chart of a graphics processing method according to an embodiment of the present invention;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in various different ways without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.
Vulkan and Direct X have both been extended to support ray tracing, all implemented using similar frameworks. The ray tracing function is roughly divided into two parts, namely, a ray tracing acceleration structure and a traversing acceleration structure. The following architecture involves only the latter traversal of the acceleration structure and assumes that the acceleration structure can be created on a general purpose processor or another, special purpose processor that is not disclosed.
In the prior art, acceleration structures are used to reduce the number of ray-primitive intersection tests that need to be performed to determine which primitives a ray intersects. A scene may be defined by a set of primitives, which are assumed to be triangles for ease of description. It will be understood that the invention is not limited thereto. Triangles are given by three points in a three-dimensional coordinate system. A ray is a ray segment in a three-dimensional coordinate system, given by a point (ray origin O), a vector (ray direction D), and two scalar quantities (minimum and maximum distances). The point along such a ray is given by p=o+dt, where t must be between a minimum distance and a maximum distance.
FIG. 1 is a schematic diagram of a set of triangles in a two-dimensional coordinate system. As shown in fig. 1, the triangles are layered and are strictly defined by an axis aligned primitive bounding box (Axis Aligned Bounding Boxes, AABBs), in fig. 1, 11 triangles are divided into 4 groups defined by 4 primitive bounding boxes, specifically primitive bounding box 1 includes triangles A, B, C and D, primitive bounding box 2 includes triangles E and F, primitive bounding box 3 includes triangles G and H, and primitive bounding box 4 includes triangles I, J and K. Fig. 2 shows a topological view of the same acceleration structure as fig. 1. In an embodiment of the invention, the axis aligned primitive bounding boxes are represented by squares (also called nodes) in a tree structure, with the arrows from one graph to another representing parent-child relationships. A node may have a number of child nodes (in this example up to 4 child nodes) and the primitive bounding boxes of all child nodes should be included in the primitive bounding box of the parent node. Specifically, node 1 corresponds to primitive bounding box 1 shown in fig. 1, its child nodes 5 to 8 correspond to triangles A, B, C and D, respectively, node 2 corresponds to primitive bounding box 2 shown in fig. 1, its child nodes 9 to 10 correspond to triangles E and F, respectively, node 3 corresponds to primitive bounding box 3 shown in fig. 1, its child nodes 11 to 12 correspond to triangles G and H, respectively, and node 4 corresponds to primitive bounding box 4 shown in fig. 1, its child nodes 13 to 15 correspond to triangles I, J and K, respectively.
Ray tracing architecture may use a stack to traverse the acceleration structure shown in fig. 1 to implement depth-first search. First it is tested if the ray intersects root node 0. If it is determined that the ray intersects root node 0, references to its child nodes 1,2, 3, and 4 will be pushed onto the stack. Note that child nodes 1,2, 3, and 4 may be stored in memory consecutively, so that in practice only one reference to the beginning of the group is needed. Next, the top entry in the stack will be popped for the primitive bounding box for the next intersection test. Child nodes 1,2, 3 and 4 are acquired and subjected to intersection testing. Primitive bounding boxes that intersect the ray push its child nodes into the stack, while primitive bounding boxes that do not intersect the ray will not change the stack. If a node referencing a triangle is determined to intersect, then the reference added to the stack will be a reference to the triangle. When this item is popped from the stack, ray triangle intersection testing will occur. If the ray does intersect a triangle, it can be recorded. Typically, the nearest triangle intersection (i.e., the smallest t in the ray equation) needs to be determined, so a record of the nearest triangle can be saved and updated as the triangle intersects. This process will continue until the stack is empty and the tree structure has been completely traversed.
It will be appreciated that the order in which sub-references of hit nodes are pushed onto the stack may affect overall performance. If it is known that the nearest triangle is being searched, then when any intersecting triangle is found, all paths in the tree structure that are farther than the triangle can be skipped, known as distance culling. Branches in many tree structures may be skipped if it happens that the nearest triangle is found first. In contrast, if the nearest triangle is not found before searching a large number of tree structures, fewer triangles (at least due to distance) will be skipped. Therefore, it is preferable to search the tree structure in an order in which the nearest triangle can be preferentially found. In embodiments of the present invention, sub-references may be pushed onto the stack in reverse order of distance so that they pop off the stack in increasing order of distance.
While the use of acceleration structures may reduce the complexity and time-consuming traversal of each primitive by each ray in a ray tracing process, ray tracing systems often require the introduction of specialized states and data to manage the traversal process of the acceleration structures. Such states and data required for acceleration structure traversal management may place significant demands on computational and memory bandwidth.
One implementation of the acceleration structure traversal is to utilize a stack to store the state and data of nodes in the acceleration structure that have not been traversed for subsequent traversal of the acceleration structure.
The introduction of a specialized hardware manager to optimize the management of traversal of the acceleration structure may further improve the performance of computer graphics processing system ray tracing.
In order to further improve the performance of ray tracing, an embodiment of the present invention provides a ray stack manager for managing a ray stack and improving ray traversal efficiency, thereby realizing effective improvement of ray tracing performance.
As shown in fig. 3, the light ray stack manager according to the embodiment of the present invention includes a plurality of management units, each management unit includes a stack storage module and an access logic module, the stack storage module is configured to store a stack of corresponding light rays, and the access logic module is configured to receive a stack request and manage the stack in the stack storage module corresponding to the stack request according to the stack request, where each light ray corresponds to one stack, and each stack stores a traversal state of the corresponding light ray. In an embodiment of the present invention, the ray stack manager is configured to store the traversal state of each ray to complete the quad-tree search.
In some embodiments, the light stack manager further comprises a memory interface multiplexed by a plurality of said management units for data interaction with the memory, wherein said data interaction comprises linefilling and write-back. In an embodiment of the present invention, the ray stack manager includes 4 management units, and the internal structures of the 4 management units may be identical, so that 4 rays may be processed simultaneously in parallel in one cycle. The 4 management units multiplex a plurality of memory interfaces through an arbiter for data interaction with an external memory. Specifically, interfaces to send commands/data to memory (memory located outside of the ray stack manager) and interfaces to receive commands/data from memory, linefill (Linefill) and/or write back (Writeback), and linefill responses and/or write back responses from memory may be included. It will be appreciated that the light stack manager of embodiments of the present invention may be configured with different numbers of management units and memory interfaces depending on the application, as the invention is not limited in this regard.
In some embodiments, the stack storage module is a dual port sram, and each stack entry includes a pointer to at least one node that the corresponding ray needs to traverse, the node being determined from an acceleration structure of ray tracing. It will be appreciated that the invention is not limited as to what acceleration configuration is employed. In the embodiment of the invention, the stack storage module caches stack blocks at the top of each stack.
In some embodiments, each of the stacks includes at least one stack block, each stack block including a plurality of entries, wherein the stack blocks in different stacks have the same size. Specifically, each ray stack is divided into a plurality of identically sized stack blocks, with each stack block having 4 consecutive stack entries, e.g., each entry having a 4B address size, each stack block being 16B. In the embodiment of the invention, the stack block of each ray in the traversal is kept at the top of the corresponding stack, and the stack block of each ray in the stack storage module is kept in the available state, so that delay or pause caused by cache miss is effectively avoided.
FIG. 4 is a schematic diagram of a stack block according to an embodiment of the present invention. In the embodiment of the invention, taking 4 rays as an example, 4 stacks are respectively constructed correspondingly, each stack is divided into at least one stack block, and each stack block has 4 entries. Specifically, the size of each stack is indicated by the arrow in fig. 4, and the cached stack blocks are represented by the diagonally shaded blocks. For Ray 2 (Ray 2), its stack size is an integer multiple of the stack block size. In some implementations, if the current stack block is empty, the next stack block to the current stack block is cached in the stack storage module so that there is always data available for the pop request to avoid delay or suspension.
In some embodiments, each access logic unit comprises a front-end processing module for receiving the stack request and classifying the stack request according to a stack state in the stack storage module, and a plurality of processing modules for responding to the stack request according to the classification of the front-end processing module. In the embodiment of the invention, the front-end processing module receives stack requests sent by other modules, including pop requests and push requests, and classifies the stack requests. In an embodiment of the present invention, the front-end processing module may process a stack request per cycle. Specifically, a stack request can be classified into the following four cases:
sort 1-pop and not linefill, i.e., pop data from the stack and no linefill is needed.
Sort 2. Pop and linefill, i.e., pop data from the stack and require linefill.
Class 3 writes and evicts data, i.e., data is written to the stack and the original data in the stack needs to be written back to memory.
Class 4 writes and does not evict data, i.e., writes data to the stack and does not need to write the original data in the stack back to memory.
In some embodiments, each access logic unit includes 4 processing modules for correspondingly processing the 4-class stack requests. In the embodiment of the present invention, the 4 processing modules are further configured to update, through the arbiter, the stack blocks in the stack storage module, and generate a linefill request and/or a write-back request sent to the memory.
In the embodiment of the invention, when the stack request is a pop request and the currently cached stack block has more than one entry, the first processing module processes the pop request to pop the entry at the top of the requested stack block onto the stack.
In the embodiment of the invention, when the stack request is a pop request and the currently cached stack block only has one entry, the second processing module processes the pop request, pops the requested entry onto the stack, generates a linefill request and sends the linefill request to the memory, and performs a prefetch operation on the stack block below the current stack block. Thus, delay caused by waiting for data to be fetched from the memory can be avoided. It will be appreciated that when the stack is empty, then no prefetch operation is performed.
In the embodiment of the invention, when the stack request is a push request and the stack block of the current cache overflows, the third processing module processes the stack request to generate a write-back request, sends the write-back request to the memory, writes the data of the current stack block back to the memory, and updates the data. Specifically, the data in the current stack block is replaced with the data in the next higher stack block.
In the embodiment of the invention, when the stack request is a push request and the currently cached stack block is not overflowed, the fourth processing module processes the stack request and writes data into the current stack block for updating.
In some embodiments, each access logic unit further includes a stack size storage module, where the stack size storage module is configured to store a size of a stack, and the front-end processing module determines whether an overflow or an underflow occurs in a stack block according to stack size information stored in the stack size storage module. In the embodiment of the invention, the stack size storage module stores the size of each stack, and the front-end processing module can classify stack requests by combining the stack size information. Specifically, the original size information of the stack is compared with the new size information after the push/pop operation is adjusted to determine whether overflow or underflow of the current stack block will occur.
In some implementations, very large memory delays can cause stalls. If the push or pop is done to the memory, the ray will continue to be turned back to the ray-schedule module for the next test. The stack entry may be selected for scheduling, node acquisition, intersection, and traversal before it completes the update. In some implementations, each access logic unit further includes a row state storage module for storing a state of a stack block to determine whether the stack block is waiting for row fill. In the embodiment of the present invention, for example, the state indicates that the current stack block waits for the line to be filled from the memory, and then the pop stack request needs to wait for the line to be filled to be processed.
In some embodiments, the line status storage module is further configured to track dirty bits (dirty bits) for each stack block, and write the stack block back to memory when the stack block has dirty bits, wherein the dirty bits are used to indicate that data of the stack block is overwritten. In the embodiment of the invention, when the acquisition and the refreshing of the stack block are not modified (i.e. not dirty), the stack block does not need to be written back to the memory, thereby improving the data processing efficiency.
In some implementations, each access logic unit further includes a deferral request module for storing a stack request when a stack block corresponding to the stack request waits for a linefill, the deferral request module further for suspending an input interface of the ray stack manager when it is full. In an embodiment of the present invention, if the requested stack block is not ready, to avoid suspending the process flow, the stack request is placed in the deferred request module until the requested stack block is ready. And when the delay request module is fully written, namely when the stored stack requests needing to be delayed reach the preset quantity, suspending the input interface to avoid stack processing errors.
In some embodiments, the plurality of processing modules are further configured to update, by the arbiter, a stack block in the stack storage module, and to generate a linefill request and/or a write-back request to the memory. In the embodiment of the invention, after the stack request is classified, the stack request is processed by the corresponding first processing module to the fourth processing module, or the stack request is put into the delay request module. The first to fourth processing modules perform updating of stack blocks in the stack memory module through the arbiter, and generate a linefill request or a data evict (evict) request to be sent to the memory when needed. In some implementations, the processing module is further configured to return a pop stack result.
In some embodiments, a linefill (read) response from memory may eventually return and enter the linefill module (e.g., a FIFO). The linefill must update the stack memory module and the line status memory module and signal that any stack requests that are suspended on the stack block can be processed. For an eviction (write) response, only the line state storage module needs to be updated.
By adopting the ray stack manager provided by the embodiment of the invention, one stack is managed for each ray through a plurality of parallel processing units, each stack is divided into a plurality of stack blocks, the stack blocks at the top end of each ray are kept in the stack storage module of the processing unit, and the plurality of rays can be kept to be processed in parallel, so that the plurality of stacks are effectively managed, the efficiency of ray traversal processing is improved, and the performance of ray tracing is optimized.
FIG. 5 is a schematic diagram of a ray tracing kernel according to an embodiment of the present invention. In some embodiments, the ray tracing kernel includes a ray stack manager as described above, and further includes a ray scheduling module for generating a ray index, selecting rays from a plurality of rays that enter an intersection test, a ray intersection module for performing ray-primitive bounding boxes and/or ray-polygon (e.g., triangle) intersection tests, and a ray traversal module for sending a stack request to the ray stack manager based on the results of the ray intersection module to determine the next node that each ray that passes the intersection test needs to traverse, the ray stack manager for receiving the stack request, and managing the stack state of each ray. The ray tracing kernel of the embodiment of the invention is provided with the ray stack manager, optimally manages the traversing states of a plurality of rays, and realizes the parallel processing of the plurality of rays, thereby improving the efficiency of the whole ray tracing.
In an embodiment of the present invention, a ray scheduling module of a ray tracing kernel (RT Core) receives data of a plurality of rays, where each ray waiting to be scheduled has an address associated with one or more nodes to be tested in an acceleration structure. For a new ray, its address defaults to the root node of the acceleration structure.
In some embodiments, the ray tracing kernel further comprises a ray data storage module for storing information of a plurality of rays during ray traversal, wherein the information comprises at least one of ray origin, ray direction, ray minimum distance, ray maximum distance, and ray hit nearest primitive, or a combination thereof. Specifically, the ray tracing kernel initializes an on-chip ray data memory when new rays are received. During ray traversal, the ray data storage module locally stores ray data. The data stored here includes, for example, ray origin, ray direction, ray minimum distance, ray maximum distance, and information about the nearest intersecting primitive (HIT PRIMITIVE). It will be appreciated that the light data may also include more information according to actual needs, and the invention is not limited in this regard.
In some embodiments, the ray tracing kernel further includes a node data acquisition module configured to acquire node information of an acceleration structure from a Memory Hierarchy (Memory Hierarchy), and further configured to send the ray index and the node information received from the ray scheduling module to the ray intersection module. Specifically, the ray scheduling module advances the selected rays to the node data acquisition stage. At this stage, the acceleration structure node data from the hierarchical memory is requested and received by the node data acquisition module. Upon receiving the node data, the node data acquisition module may send the node data and the ray index to the ray intersection module, which may advance to the ray intersection test phase.
In some implementations, the ray intersection module includes a hardware unit for performing ray intersection testing phases with primitive bounding boxes and ray intersection with polygons (e.g., triangles). In some embodiments, the ray intersection module further comprises a conversion unit for converting rays from the world space coordinate system to the model space coordinate system, which is necessary to support geometrical transformations in the acceleration structure. In the embodiment of the invention, the ray intersection module sends intersection test and transformation calculation results to the ray traversal module.
In some embodiments, the ray traversal module is configured to send a pointer back to the ray-scheduling module after determining the next node to be traversed by each ray that passes the intersection test. Specifically, the ray traversal module determines a next node to be examined for each ray, and manages an address stack to be examined for each ray by the ray stack manager to manage the traversal state of the ray. Once the ray traversal module determines which nodes need to be tested next, the pointers are forwarded back to the ray-dispatch module. In the ray scheduling module, rays wait to be scheduled again.
In the embodiment of the invention, the nodes are divided into four groups, thereby obtaining a quadtree structure. Up to four primitive bounding boxes will be stored in memory in succession and taken and tested together. If a ray has just tested a primitive bounding box in an acceleration structure, then there are three possible outcomes:
1 st, primitive bounding box without hit. At this point, the next node to be tested is popped off the stack.
2 Nd hit a primitive bounding box. The next node to be tested is referred to as a child node of the intersecting node at this time.
3. Hit more than one primitive bounding box. The next node to be measured is now referred to as the child node of the nearest intersecting node. Determined by the hit distance returned by the primitive bounding box intersection tester. And pushes the remaining child node pointers into the stack from furthest to closest.
If the ray is converted, it has only one child node, which would be considered a hit, and forwarded to the ray-schedule module. If a ray-triangle intersection is performed by a ray, the address of the next node is popped from the stack.
In some embodiments, the ray stack manager is responsible for managing these stacks and reducing the memory bandwidth resulting from accessing them.
In some embodiments, in the ray tracing kernel, there may be multiple rays simultaneously in each process flow, i.e., at the same time, there are rays such as ray scheduling phase, ray intersection node and ray traversal phase, all with rays to mask the delay caused by node acquisition. In the embodiment of the invention, the processing of the light rays can be parallelized, so that N light rays can be processed in parallel in each period in each stage.
In some implementations, pauses or delays in the data pipeline are avoided by optimizing access to memory. For example, in the case of category 3 above, when the original data needs to be evicted when pushing the data onto the stack, it may be put into the write-back memory queue and continue to receive other stack requests until the write-back is completed, since there is no data waiting to be returned at this time.
In some implementations, for pop stack requests, linefills are performed when the current stack block becomes empty to ensure that the complete stack block of its lower layer is available upon revisit. If linefills are not performed in advance, pop-up stack requests may result in long delays waiting for immediate linefills. In the embodiment of the invention, taking the quadtree as an example, at most three entries can be pushed in one cycle (the fourth entry is the most recent hit, which bypasses the ray stack manager and is forwarded directly to the ray scheduler), but there is only one pop stack in one cycle, so it is more reasonable to prefetch the lower stack block.
In some embodiments, the arrangement of data for the stack blocks in memory is optimized. One standard data arrangement is to have R rays, one stack for each ray, N stack blocks for each stack, and M entries for each stack block. In a similar C language, the corresponding data format is data [ R ] [ N ] [ M ].
However, when considering access patterns, it is more likely that multiple rays access the same stack layer at the same time, rather than the same ray accessing the next layer of the stack. This is because one ray has a long delay through the entire ray tracing kernel, and many other rays will pass before recycling back to the first ray. Because these stack blocks may be fetched by cache lines of the hierarchical memory structure, these cache lines may be larger than one stack block. Thus, there is no need for a cache line to contain 0 to 3 stack blocks (0 to 15 stack entries) of a ray. In the embodiment of the invention, the stack blocks are horizontally arranged on all rays, and the data format is data [ N ] [ R ] [ M ]. Therefore, the data in the higher-layer cache has higher utilization rate, and the data processing efficiency is improved.
FIG. 6 is a schematic diagram of a graphics processing system according to an embodiment of the present invention, including a plurality of ray tracing kernels and a plurality of Shader kernels (Shader cores) as described above, where a plurality of Shader kernels are configured to invoke ray tracing functions and send initialization information required for ray traversal to the ray tracing kernels, and the ray tracing kernels are configured to perform ray traversal according to the initialization information and return results, and a plurality of Shader kernels are also configured to perform illumination calculation according to the return results.
In some implementations, the GPU is comprised of a plurality of shader cores that are responsible for executing shader programs defined by the application. The plurality of ray tracing kernels are responsible for performing ray tracing functions. In the embodiment of the present invention, the ray tracing kernel and the shader kernel are in a 1:1 relationship, and it can be understood that any ratio mapping may be performed between the ray tracing kernel and the shader kernel by using a crossbar or the like, which is not limited in the present invention.
In some existing APIs, rays are defined by an application executed by a shader core and may call TraceRay () functions (or similar functions) to request that the hardware ray-tracing core perform ray-tracing traversals according to a specified acceleration structure. Referring to the architecture shown in fig. 6, the shader core executes a generic application program that can call TraceRay () functions. At this point, data (e.g., ray data, acceleration structure data, parameters, etc.) for initializing the traversals is sent from the shader core to the ray-tracing core, and then all traversals are processed by the ray-tracing core. In an embodiment of the present invention, traceRay () function is a blocking call, so the shader core waits for the ray trace core to complete the traversal and return the result (the shader core will execute other threads). Specifically, the returned result will indicate whether the ray intersected a primitive and which primitive intersected was the nearest one, what was the distance from it. The shader core will use this information for illumination calculations. By adopting the ray tracing manager, the ray tracing manager is provided with the ray stack manager, and the ray traversing state is optimally managed, so that the overall ray tracing efficiency is improved.
As shown in fig. 7, corresponding to the above-mentioned graphics processing system, the embodiment of the present invention further provides a graphics processing method, where the method includes:
step 701, maintaining a stack for each ray to be traversed using a ray stack manager in a ray tracing kernel as described above to store traversal state information for each of the rays. In some implementations, one stack is created for each ray to be traversed, each stack divided into a plurality of stack blocks, each stack block having a plurality of entries. In the embodiment of the invention, a plurality of rays can be processed in parallel in the same period, and the plurality of rays can be processed in different stages, and the stack block at the top of each ray is cached.
Step 703, determining the next node to be traversed by each ray according to the stack request and the traversing state information until the traversing of all rays is completed and determining the traversing result. In embodiments of the present invention, stacks of each ray are managed according to stack state and stack requests, and each stack request is responded to, such as pop-up, push-in, linefill, update stack block data, etc. Specifically, the stack blocks are arranged horizontally according to the light rays, so that the memory addresses of the N stack blocks are arranged continuously for all the light rays, whereas the stack blocks are arranged vertically for the stacks, so that the memory addresses corresponding to the stack blocks in one stack are not arranged continuously.
In some embodiments, when a cached stack block becomes empty, a prefetch operation is performed on its next stack block.
In some implementations, the state of the stack block is tracked, and memory is not written back for unmodified stack blocks.
In some implementations, stack requests that are still waiting for linefills are deferred to avoid data processing flow pauses.
Step 705, returning the traversing result to the shader kernel for subsequent rendering. In some implementations, tiles are rendered according to the generated tile display list, resulting in a rendered image.
A more specific implementation manner of each step of the graphics processing method of the present invention may refer to the description of the graphics processing system of the present invention, and have similar advantageous effects, which are not described herein.
Fig. 8 is a block diagram of an electronic device according to an embodiment of the application. The embodiment of the application also provides an electronic device, as shown in fig. 8, which comprises at least one processor 801 and a memory 803 communicatively connected to the at least one processor 801. The memory 803 stores instructions executable by the at least one processor 801. The instructions are executed by at least one processor 801. The processor 801 when executing the instructions implements the graphics processing method in the above embodiment. The number of memories 803 and processors 801 may be one or more. The electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.
The electronic device may also include a communication interface 805 for communicating with external devices for data interactive transmission. The various devices are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor 801 may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of a graphical user interface (GRAPHICAL USER INTERFACE, GUI) on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 8, but not only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 803, the processor 801, and the communication interface 805 are integrated on a chip, the memory 803, the processor 801, and the communication interface 805 may complete communication with each other through internal interfaces.
It should be appreciated that the processor may be a central Processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (DIGITAL SIGNAL Processing, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field programmable gate array (Field Programmable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an advanced reduced instruction set machine (ADVANCED RISC MACHINES, ARM) architecture.
An embodiment of the present application provides a computer-readable storage medium (such as the memory 803 described above) storing computer instructions that when executed by a processor implement the method provided in the embodiment of the present application.
Alternatively, the memory 803 may include a storage program area that may store an operating system, an application program required for at least one function, and a storage data area that may store data created according to the use of the electronic device of the graphic processing method, etc. In addition, memory 803 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 803 optionally includes memory located remotely from processor 801 which may be connected to the graphics processing method electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Any process or method description in a flowchart or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more (two or more) executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes additional implementations in which functions may be performed in a substantially simultaneous manner or in an opposite order from that shown or discussed, including in accordance with the functions that are involved.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. All or part of the steps of the methods of the embodiments described above may be performed by a program that, when executed, comprises one or a combination of the steps of the method embodiments, instructs the associated hardware to perform the method.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules described above, if implemented in the form of software functional modules and sold or used as a stand-alone product, may also be stored in a computer-readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that various changes and substitutions are possible within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
Claims (25)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310649821.7A CN119068099A (en) | 2023-05-31 | 2023-05-31 | A light stack manager, graphics processing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310649821.7A CN119068099A (en) | 2023-05-31 | 2023-05-31 | A light stack manager, graphics processing method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN119068099A true CN119068099A (en) | 2024-12-03 |
Family
ID=93632462
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310649821.7A Pending CN119068099A (en) | 2023-05-31 | 2023-05-31 | A light stack manager, graphics processing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119068099A (en) |
-
2023
- 2023-05-31 CN CN202310649821.7A patent/CN119068099A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110176054B (en) | Generation of composite images for training neural network models | |
US10235338B2 (en) | Short stack traversal of tree data structures | |
US9214007B2 (en) | Graphics processor having unified cache system | |
US10810784B1 (en) | Techniques for preloading textures in rendering graphics | |
CN117157676A (en) | Accelerating triangle visibility testing for real-time ray tracing | |
CN106504181B (en) | Graphics processing method and system for processing sub-primitives | |
US8301672B2 (en) | GPU assisted garbage collection | |
US6426753B1 (en) | Cache memory for high latency and out-of-order return of texture data | |
CN112749120B (en) | Technology that efficiently transfers data to the processor | |
US7996621B2 (en) | Data cache invalidate with data dependent expiration using a step value | |
CN106504184B (en) | Graphics processing method and system for processing sub-primitives | |
US8826299B2 (en) | Spawned message state determination | |
US11798221B2 (en) | Graphics processing | |
CN110807827B (en) | System generates stable barycentric coordinates and direct plane equation access | |
GB2607348A (en) | Graphics processing | |
CN112041894B (en) | Enhancing realism of a scene involving a water surface during rendering | |
US20230351673A1 (en) | Coherency Gathering for Ray Tracing | |
CN118613836A (en) | Graphics Processing Unit Traversal Engine | |
US11734869B2 (en) | Graphics processing | |
US20240371070A1 (en) | Graphics processing | |
US20240078741A1 (en) | Graphics processing | |
CN119068099A (en) | A light stack manager, graphics processing method and system | |
US20230206384A1 (en) | Dead surface invalidation | |
WO2018118364A1 (en) | Memory consistency in graphics memory hierarchy with relaxed ordering | |
US20250014259A1 (en) | Graphics processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |