Disclosure of Invention
The invention aims to overcome the defect of redundant calculation of the accelerator in the prior art and provides a priority sensing processing method oriented to graph calculation and a hardware accelerator.
The aim of the invention can be achieved by the following technical scheme:
A priority perception processing method facing graph calculation comprises the following steps:
Receiving image data to be processed, traversing each vertex in the image data, predicting the priority of all adjacent edges of each vertex, and reordering all adjacent edges according to the priority prediction result;
According to the reordering results of all adjacent edges of each vertex, classifying and storing each adjacent edge;
And simultaneously scheduling a plurality of vertexes for parallel calculation, and scheduling adjacent edges from a storage area with high priority in sequence in each vertex for task processing, if the vertex state obtained by processing converges to a final result, terminating the processing task of the current vertex, and skipping all the remaining adjacent edges which are not traversed by the current vertex.
Further, the reordering process comprises the steps of obtaining all adjacent edges and adjacent points of the vertex, and calculating the priority of each adjacent edge according to the weight value of each adjacent edge and the degree of each adjacent point;
traversing all adjacent edges of the vertex, and carrying out descending rearrangement on the adjacent edges according to the priority of the adjacent edges to obtain a reordered adjacent table.
Further, the hierarchical storage process comprises selecting the first K adjacent sides from the reordering results of each vertex according to a preset boundary value K, storing the first K adjacent sides in a high-priority side area, and storing the rest adjacent sides in a low-priority side area;
If the number of adjacent edges with vertexes is smaller than the boundary value K, filling placeholders in the corresponding high-priority edge areas.
Further, in the task processing process of the adjacent edges of the vertex, the high-priority edge area corresponding to the vertex is accessed first, calculation tasks are performed by traversing all the adjacent edges in the high-priority edge area to update the vertex state, if the vertex state converges to a final result, the processing task of the current vertex is terminated, the rest low-priority edge areas are skipped, and if the vertex state does not converge to the final result, the adjacent edges in the low-priority edge area are waited to be scheduled for access again for calculation.
The invention also provides a hardware accelerator for graph calculation, which comprises a processor, a scheduler, a heat value manager, a merger for sensing memory efficiency, an on-chip interconnection network, a high-speed internal memory and an external memory;
the external memory is used for storing the graph structure information and the edge data required by the processor;
The processor is used for processing the graph calculation task in parallel by a plurality of basic processing units, each basic processing unit is distributed with a corresponding storage space for storing vertex data in a high-speed internal memory, each basic processing unit reads the vertex data from the corresponding storage space, carries out priority prediction on all adjacent edges of the vertex, and executes the corresponding graph calculation task according to the instruction of the scheduler;
The high-speed internal memory is used for distributing a memory space for each basic processing unit and storing vertex values, calculated intermediate results and edge offset memory addresses;
The heat value manager is used for reading corresponding data from the high-speed internal memory according to the received high-degree vertex, and realizing independent management and calculation of the high-degree vertex;
The scheduler is used for distributing a high-priority edge area and a low-priority edge area for each processing unit, carrying out hierarchical storage according to the priority of the adjacent edge, and carrying out the scheduling of vertex tasks preferentially from the high-priority edge areas;
The on-chip interconnection network is used for data interaction and communication between different processing units in the processor;
the memory efficiency aware combiner is configured to read edge data from the external memory and transmit the edge data to the processor.
Further, the processing procedure of the scheduler specifically includes:
the scheduler allocates a high-priority edge area, a low-priority edge area, a high-priority buffer area and a low-priority buffer area for each processing unit, reorders all adjacent edges according to the priority prediction result of each processing unit, and stores the reordered edges in the high-priority edge area and the low-priority edge area in a grading manner;
receiving a vertex ID transmitted by a processor, checking whether the vertex ID is scheduled for the first time in the current iteration round, if so, distributing a unique and fixed global workload ID for the vertex, and storing the unique and fixed global workload ID into a corresponding high-priority buffer zone;
In the processing stage, sequentially reading out vertex tasks to be processed from the high-priority buffer area in a first-in first-out mode, reading corresponding high-priority edges from the high-priority edge areas according to the current vertex ID, and sending the corresponding high-priority edges to corresponding processing units to execute graph calculation tasks;
Sequentially dispatching vertexes of the low-priority buffer areas, reading memory address offsets of low-priority edges in the corresponding low-priority edge areas according to vertex IDs, collecting the memory address offsets of a plurality of low-priority edges in the processing process, packaging and sending requests to a combiner, and uniformly accessing the low-priority edges stored in an external memory through the combiner.
Further, the processing procedure of the combiner specifically includes:
And selecting a request with the minimum memory address, using the request address plus a cache line size as an address threshold, traversing all received request addresses, and if the request address is smaller than the address threshold, merging the request with the minimum memory address and then accessing an external memory.
Further, the high-priority side area is a storage area formed by high-speed read-write hardware storage equipment, and the low-priority side area is a storage area formed by low-speed read-write hardware storage equipment.
Further, the processing procedure of the heat value manager specifically comprises the following steps:
According to the vertex ID of the height degree vertex transmitted by the processor, reading out the corresponding vertex attribute value from the high-speed internal memory, and storing and backing up;
After each round of calculation is started, data corresponding to the high-degree vertexes are sent to a basic processing unit to carry out vertex updating tasks, coarse granularity data synchronization is carried out after the high-degree vertexes are updated, corresponding high-degree vertex data are backed up, a heat value manager checks whether stored data in the heat value manager is hit or not according to the data to be read by the vertex updating tasks every time when the vertex updating tasks are generated later, if yes, the high-degree vertex data are directly returned to the basic processing unit, and otherwise, the high-degree vertex data are forwarded to an on-chip network interface to be routed to access the corresponding data.
Further, the processor divides the plurality of basic processing units into a coarse-grained Tile processing group, the plurality of basic processing units of each Tile processing group share backup data in the heat value manager, and data among different Tile processing groups are not shared.
Compared with the prior art, the invention has the following advantages:
(1) The invention utilizes the priority-aware edge processing mode, and greatly improves the throughput of graph calculation by processing edge tasks of a plurality of vertexes in parallel. Compared with the prior art, the method and the device have the advantages that the edge tasks which have important contribution to the final result are preferentially processed by introducing the edge serial execution mode with the vertex granularity, so that unnecessary calculation load is reduced. By sequencing the priorities of the edge tasks, the computing resources are ensured to be concentrated for the most critical edge tasks, and unnecessary computing waste is avoided.
(2) The invention adopts the efficient hierarchical layout of the memory and stores the high-priority side and the low-priority side separately. The layout design enables the system to preferentially acquire the data of the high priority side when accessing the memory, thereby reducing redundant data access in the cache line and improving the efficiency of memory access.
(3) The present invention proposes a heating value manager to alleviate the blocking problem of the network on chip. The manager manages the vertex data accessed by high frequency through the distributed cache, so that the data contention of high-priority vertices in the network-on-chip is reduced, and the parallel processing efficiency is improved. The design ensures that data is efficiently distributed and accessed among multiple processing units, and reduces pipeline stalls and delay problems caused by data contention.
(4) In order to further optimize the memory access efficiency, the invention introduces a memory-aware combiner. The combiner can dynamically combine the memory requests of the low priority side, thereby remarkably reducing the consumption of the memory bandwidth. By reducing frequent accesses to the memory, the performance and energy efficiency of the overall system are improved.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
It is to be understood that these terms may be variously interpreted or substituted by one of ordinary skill in the art based on the description of the present invention. For example, in the present invention, a "module" or a "unit" may be a component implemented in hardware, or a functional module implemented in software or firmware, and, similarly, a "connection" or a "coupling" may be a physical connection, or a logical connection or a functional connection. Unless specifically stated otherwise, the use of these terms should be construed as a prerequisite for the overall understanding of the invention, and should not be limited to certain specific definitions.
The following terms are first introduced, explained and related to the present invention.
The Intra-Edge-Sequential Model (IESM) is an Edge serial execution mode proposed by the present invention. The specific behavior is that one edge of a plurality of vertexes is scheduled at the same time to simultaneously execute the calculation task of the plurality of vertexes;
the Hot-Value Manager (HVM) is a heat Value Manager designed by the invention and is used for independently storing and centrally managing a plurality of frequently accessed vertex data, so that the problems of data competition and pipeline stagnation caused by data access conflict among different processing units are reduced;
On-chip cache-on-chip cache refers to an internal cache memory integrated in a processor or dedicated hardware accelerator. What is referred to in the present invention is a scratchpad memory (SCRATCHPAD MEMORY, SPM) in the FPGA. SPM is a small, high-speed memory for storing temporary data that can be directly connected to a logic unit, providing high bandwidth and low latency data access capabilities. The method is used for storing the instructions and data which are currently needed by the processor, and reduces the delay of accessing the external memory;
The minimum basic processing unit (Processing Element) in the invention is an independent module for performing basic calculations. In graph computation applications, computations, such as addition, multiplication, and comparison operations, that occur during vertex updates are typically performed. Each PE can independently process data calculation tasks, and a multi-PE parallel architecture is adopted in the invention, so that a plurality of calculation tasks can be processed simultaneously, thereby accelerating the graph calculation execution process;
Tile is used to divide multiple PEs into a coarse-grained PE group. The PEs in each Tile share the same communication channel connected to the external memory and memory area in the same block of external memory. This partitioning serves to improve data multiplexing and locality;
NoC Network-on-Chip (NoC) is an interconnection architecture for communication between different modules. Nocs are designed to improve the communication bottleneck between components within a System-on-Chip (SoC) by providing an efficient, scalable, and low-latency data transfer solution. In the invention, noC is used for connecting different PEs to realize data interaction among PEs;
HyperX-HyperX is a high performance network topology commonly used in massively parallel computing systems and data center networks. It is an improved multi-stage network topology aimed at providing high bandwidth, low latency and high fault tolerance. The design goal of the HyperX topology is to maintain some scalability while reducing network diameter and increasing bandwidth. NoC implementations in the present invention employ the HyperX architecture.
Example 1
As shown in fig. 1, the present embodiment provides a priority sensing processing method for graph-oriented computation, which includes the following steps:
And S1, adopting vertex-level parallel granularity, simultaneously scheduling a plurality of vertexes, and scheduling only one adjacent side for task processing by each vertex. When the vertex state converges to the final result, immediately stopping the current vertex calculation task, and skipping over all the left edges which are not traversed so as to reduce redundant invalid calculation;
The specific process of scheduling is as follows:
Constructing a graph data storage form in a preprocessing stage, uniformly distributing fixed mutually orthogonal vertex data for each processing unit, and storing the vertex data in a local cache of the processing unit;
the processing unit schedules an active vertex from its own vertex data cache in first-out order and reads an edge of the active vertex at the same time. The plurality of processing units simultaneously perform edge calculation tasks of different vertexes;
the processing unit performs remote data access on the values of adjacent vertexes required by remote data access through the network on chip;
when the vertex completes one edge calculation task, if the vertex has converged to a final result, the traversing behavior is terminated, the rest of the edges are skipped and the result is written back.
S2, predicting a priority for all sides of each vertex through a prediction function, and reordering all sides according to the priority to preferentially process sides which are more likely to contribute to vertex updating, accelerating the convergence speed of the vertex, and skipping more sides;
the specific process of sequencing is as follows:
traversing each vertex in the preprocessing stage, traversing all adjacent edges and connected adjacent points of each vertex, and calculating the priority of each edge according to the weight value of the edge and the degree of the adjacent points connected with the edge;
For each vertex, all its edges are traversed, and the entire adjacency list is rearranged in descending order of priority of each edge to form a new adjacency list.
S3, carrying out hierarchical storage on all edges of each vertex, and reducing the problem of data redundancy of the edges in the cache line caused by a coarse-granularity memory access mode;
The specific mode of hierarchical storage is as follows:
when reordering the adjacency list, selecting the front K sides for each vertex according to the preset boundary value K and the priority value to form a high-priority side area, wherein the rest sides form a low-priority side area;
For vertices with the number of adjacent edges less than K, filling placeholders in the corresponding high-priority regions enables the high-priority edge regions of all the vertices to form a regular memory layout in the memory space.
In the task processing process of the adjacent edges of the vertexes, firstly, the high-priority edge areas corresponding to the vertexes are accessed, all the adjacent edges are traversed to perform calculation tasks to update the states of the vertexes, if the states of the vertexes are converged to a final result, the processing tasks of the current vertexes are terminated, the rest low-priority edge areas are skipped, and if the states of the vertexes are not converged to the final result, the vertexes are sent to the active vertex set to wait for scheduling to access the low-priority edge areas again.
The specific procedure of the above-described graph calculation execution mode will be described in detail.
And 1, constructing a graph data storage form, uniformly distributing fixed mutually orthogonal vertex data for each processing unit, and storing the vertex data in a local cache of the processing unit. Mutually orthogonal, i.e. the data of the two vertices are mutually independent and uncorrelated.
Specifically, for the construction of the graph data storage form, graph data is stored according to algorithm requirements, and preferably, the csr or csc format is used for storing the data, so that the read-write burden is reduced. And for vertex allocation, for PE0-n, performing n-residue operation on the serial numbers of the vertices, allocating PE according to the obtained remainder, if the remainder is 3, allocating to PE with serial number 3, and finally storing vertex data in a local cache of PE.
Step 2, the processing unit dispatches an active vertex from the vertex data cache, reads one edge of the active vertex at the same time, and a plurality of processing units perform edge calculation tasks in parallel;
Specifically, the PE reads the active vertices v from the local cache according to the first-in first-out sequence, and reads the required edges from the hierarchical storage according to the edge calculation condition of the active vertices, namely whether the edges to be read are of high priority or not, and in which buffer area. All PEs read edges in parallel and perform edge calculation tasks, and specifically, if BFS needs to judge the value between the adjacent point and the destination point.
Step 3, the processing unit performs remote data access on the values of adjacent vertexes required by remote data access through the network on chip;
Specifically, the processing unit sends a read request for the adjacent vertex values through the HyperX internet and receives data, and after receiving the data, the processing unit starts a calculation task. Specifically, the internet judges the destination by performing redundancy operation on the sequence number according to the adjacent point in the processing unit read request and the active point in the returned data request. Preferably, the latency is reduced by issuing tasks in advance.
Step 4, judging whether the vertexes are converged, writing back the converged vertexes, and returning the non-converged vertexes to the active point set;
Specifically, when the vertex completes one-time edge calculation task, whether the vertex converges or not is judged, and specifically, for the BFS algorithm, if a synchronous execution mode is used, convergence is indicated only by updating. If the vertex has converged to the final result, the vertex traversal behavior is terminated, all unnecessary edges left over are skipped, and the latest result for the vertex is written back into the local cache of the processing unit storing its vertex value. If the vertex does not converge to the final result, the completed edge of the vertex is recorded and written back to the active set of points.
Traversing each vertex, traversing all adjacent edges and connected adjacent points of each vertex, and calculating the priority of each edge according to the weight value of the edge and the degree of the adjacent points connected with the edge;
Specifically, the edge e of each point v is calculated as its priority value using formula (1).
Priority(e)=Deg(u)+λ×Wei(e) (1)
Where u is the adjacent vertex corresponding to the point v at the edge e, deg (u) represents the degree of the point u, that is, the sum of the number of outgoing edges and the number of incoming edges of the point u, wei (e) is the weight of the edge e, 0 is in the non-weight graph, λ is the scaling factor, and can be adjusted according to the graph or algorithm.
Specifically, for the adjacency list n (v) of each vertex v, traversing the adjacency list n (v), calculating and storing the priority value of each edge e by using the formula (1).
Step 6, traversing all edges of each vertex, and rearranging in descending order according to the priority of each edge;
Specifically, for the adjacency list n (v) of each vertex v, traversing the adjacency list n (v), reading the priority value calculated in step 5 for each edge e therein, and sorting the edges in n (v) according to the priority value, it should be noted that only the edges inside n (v) are sorted in descending order, and the edges between different vertices are not sorted in descending order. After the ordering is completed, it is stored in a hierarchical storage structure.
Step 7, selecting the front K sides with the largest priority for the adjacent side list of each point and storing the sides in a high-priority storage area;
Specifically, as shown in fig. 3, the adjacent edge table of each point is reordered at this time, so that the edge priority of the edge with small sequence number is higher, and therefore, the requirement can be met by only cutting the first K elements from the sequence number 0 in the adjacent edge table of each point. The value of K is preset by the user and is typically a small integer value.
Specifically, the high-priority storage area refers to a storage area formed by high-speed read-write hardware storage devices, and corresponding descriptions are provided in the hardware implementation step.
Preferably, for any vertex u, if the number of edges in its adjacent edge table is smaller than K, in order to facilitate indexing and burst read operation of the high-priority edge, K elements are still stored in the high-priority cache, where the first several elements of the K elements are all edge neighbors of the point u, and the remaining elements are blank fillers.
Step 8, adding the rest edges in the adjacent edge list of each point in the step 7 into a low-priority storage area;
Specifically, in the case where the contiguous edge table of each point has been reordered, it is only necessary to intercept all elements from the beginning of every K elements to the end of the table.
Specifically, the low-priority storage area refers to a storage area formed by a low-speed read-write hardware storage device, and the low-priority storage area is correspondingly described in the hardware implementation step.
Specifically, if in step 7 all the contiguous edges of a point have been added to the high priority storage area, this step does not require any action.
Example 2
As shown in fig. 4, the present embodiment provides a hardware accelerator for graph-oriented computation, which includes a processor, a scheduler, a heat value manager, a merger for sensing memory efficiency, an on-chip internet, a high-speed internal memory, and an external memory;
the external memory is used for storing the graph structure information and the edge data required by the processor;
The processor is used for processing the graph calculation task in parallel by a plurality of basic processing units, each basic processing unit is distributed with a corresponding storage space for storing vertex data in the high-speed internal memory, each basic processing unit reads the vertex data from the corresponding storage space, carries out priority prediction on all adjacent edges of the vertex, and executes the corresponding graph calculation task according to the instruction of the scheduler;
The high-speed internal memory is used for distributing storage space for each basic processing unit and storing vertex values, calculated intermediate results and edge offset memory addresses;
the heat value manager is used for reading corresponding data from the high-speed internal memory according to the received height degree vertexes, and realizing independent management and calculation of the height degree vertexes;
The scheduler is used for distributing a high-priority edge area and a low-priority edge area for each processing unit, carrying out hierarchical storage according to the priority of the adjacent edge, and carrying out the scheduling of vertex tasks preferentially from the high-priority edge areas;
The on-chip interconnection network is used for data interaction and communication between different processing units in the processor;
The memory efficiency aware combiner is used to read edge data from the external memory and pass it to the processor.
The processing procedure of the scheduler specifically comprises the following steps:
the scheduler allocates a high-priority edge area, a low-priority edge area, a high-priority buffer area and a low-priority buffer area for each processing unit, reorders all adjacent edges according to the priority prediction result of each processing unit, and stores the reordered edges in the high-priority edge area and the low-priority edge area in a grading manner;
Receiving a vertex ID transmitted by a processor, checking whether the vertex ID is scheduled for the first time in the current iteration round, if so, distributing a unique and fixed global workload ID for the vertex, and storing the unique and fixed global workload ID into a corresponding high-priority buffer zone;
In the processing stage, sequentially reading out vertex tasks to be processed from the high-priority buffer area in a first-in first-out mode, reading corresponding high-priority edges from the high-priority edge areas according to the current vertex ID, and sending the corresponding high-priority edges to corresponding processing units to execute graph calculation tasks;
Sequentially dispatching vertexes of the low-priority buffer areas, reading memory address offsets of low-priority edges in the corresponding low-priority edge areas according to vertex IDs, collecting the memory address offsets of a plurality of low-priority edges in the processing process, packaging and sending requests to a combiner, and uniformly accessing the low-priority edges stored in an external memory through the combiner.
The processing procedure of the combiner is specifically as follows:
selecting a request with the minimum memory address, using the request address plus a cache line size as an address threshold, traversing all received request addresses, and if the request address is smaller than the address threshold, merging the request with the minimum memory address and then accessing an external memory.
The high priority side area is a storage area composed of high-speed read-write hardware storage devices, and the low priority side area is a storage area composed of low-speed read-write hardware storage devices.
The heat value manager comprises the following processing procedures:
According to the vertex ID of the height degree vertex transmitted by the processor, reading out the corresponding vertex attribute value from the high-speed internal memory, and storing and backing up;
After each round of calculation is started, data corresponding to the high-degree vertexes are sent to a basic processing unit to carry out vertex updating tasks, coarse granularity data synchronization is carried out after the high-degree vertexes are updated, corresponding high-degree vertex data are backed up, a heat value manager checks whether stored data in the heat value manager is hit or not according to the data to be read by the vertex updating tasks every time when the vertex updating tasks are generated later, if yes, the high-degree vertex data are directly returned to the basic processing unit, and otherwise, the high-degree vertex data are forwarded to an on-chip network interface to be routed to access the corresponding data.
Optionally, the processor divides the plurality of basic processing units into one coarse-grained Tile processing group, the plurality of basic processing units of each Tile processing group share backup data in the heat value manager, and data among different Tile processing groups are not shared.
To facilitate an understanding of the structure and workflow of the accelerator, the execution thereof will now be described. Specifically, the complete flow of the graph calculation algorithm executed by the hardware accelerator may include the following steps:
Step 1, distributing continuous storage space with the same size for each PE in a high-speed internal memory. After the graph data is acquired, a read request is first sent to an external memory, and all vertex data in all the graph data are read. And sequentially distributing each vertex into a specific PE by using the mode of indexing the vertex to obtain the quantity of the rest PEs. Assuming a total of N vertices and M PEs, the number of vertices included in each PE is N/M. Each PE stores the assigned vertex data in a memory space where the current PE is fixed in the high-speed internal memory.
And 2, each PE reads vertex data from the local cache of each PE and traverses each vertex in turn. For a vertex v, the Address Offset (Address Offset) of the first edge in the adjacent edge table of the vertex is read from the internal cache according to the ID of v. According to the offset address, a request is sent to an external memory, and all edges of the vertex v are read into the current PE. Each edge of v is traversed in turn, and the weight of the edge and the IDs of all adjacent vertices of v are read by accessing the internal memory according to the edge data. The degree of the adjacent vertex is read from the internal memory based on the adjacent vertex ID. A logic calculation unit in PE calculates a priority value of the edge according to the priority prediction function based on the edge weight and the degree of adjacent vertexes. After the priorities of all the edges of v are calculated, the edges are reordered according to descending priority.
And 3, selecting the height degree vertexes, and moving the height degree vertexes to a heat value manager for centralized storage and management. Specifically, all processing units traverse each vertex in the local cache, and the degree of each vertex is read from the high-speed internal memory according to the vertex ID, and the following formula is adopted:
Where N v represents the number of vertices in the graph, maxDegree represents the maximum Degree of vertices, degree (v i) represents the Degree of vertices v i, τ represents the top τ% of the Degree of vertices desired to be obtained, and table τ η represents the calculated growth rate factor. The screening threshold T η for the high degree vertices was obtained using Maxdegree × (1- τ η%). The processing unit traverses all vertex IDs in the local cache that exceed T η in degree all into the heating value manager. When the heat value manager receives the IDs of the vertexes, the corresponding vertex attribute values are read from the high-speed internal memory and stored into the memory space allocated in the high-speed internal memory by the heat value manager. In the processor, every 16 processing units are divided into a Tile. The heat value manager copies the data backup of the plurality of high degree vertexes according to the number of the tiles, 16 PE in each Tile share the data backup of the high degree vertexes in the same heat value manager, and different tiles do not share the data backup.
And 4, after each round of iteration starts, the heat value manager firstly sends the height degree vertex to the processing unit for vertex updating task. And after the height degree vertexes are updated, carrying out coarse granularity data synchronization, and backing up one piece of data of the height degree vertexes for each Tile. In the subsequent calculation process, whenever a vertex update task is generated, the task is first sent to the heating value manager. The manager will first check if the data to be read by the task hits the high degree vertex cache in the heating value manager and if so return the high degree vertex data directly to the processing unit. If there is no hit, the request is forwarded to the network-on-chip interface for routing to access the remote processing unit.
And 5, when the algorithm starts to execute, the processor sequentially schedules each active vertex from the active vertex queue and sends the ID of the active vertex to the scheduler. When the scheduler receives the active vertex ID, it checks whether the vertex is scheduled for the first time in the current iteration round. If so, a unique and fixed global workload ID (workload ID) is allocated to the vertex and sent to the high-priority buffer. In the processing stage, the scheduler sequentially reads out vertex tasks to be processed from the high-priority buffer area in a first-in first-out mode, reads out the high-priority edge corresponding to the vertex from the high-priority edge area according to the ID of the current vertex, and sends the high-priority edge to the processing unit. After the processing unit receives the edge data, the processing unit starts to sequentially traverse all high-priority edges of the vertex, and performs corresponding calculation tasks to update the value of the target vertex. If, during the calculation, the data of the target vertex (i.e., the adjacent vertex) of the active vertex is not in the processing unit to which the current active vertex belongs, the processing unit forwards the data and the target vertex ID to the network-on-chip interface, and the network-on-chip routes the data to the corresponding processing unit according to the ID of the target vertex to perform vertex update.
If the vertex received by the scheduler is not scheduled for the first time, the scheduler directly feeds the vertex into the low priority buffer. In the processing stage, the scheduler sequentially schedules active points of the low-priority buffer area, and reads the memory address offset of the first low-priority edge from the high-speed internal storage according to the vertex ID. The scheduler will collect the low priority edge memory address offsets generated by the 16 processing units within each Tile and send the address packets to the merger to access the low priority edges stored in the external memory.
And 6, when the merger receives the access request to the external memory, the merger carries out merger and request reordering according to the granularity of the Tile.
Specifically, the merger first selects the request with the smallest memory address in the 16 processing units of each Tile. And traversing all request addresses in the current Tile by taking the minimum address plus a cache line size (CACHELINE SIZE) as a Threshold, and merging the request with the request of the minimum address if the request address is smaller than the Threshold. Thus, the merger merges all requests having addresses smaller than the threshold value, and transmits only sequential access requests to the external memory.
After merging, the merger maps the read-in data in order to ensure that the low priority edge data read from the external memory can be correctly sent back to the corresponding processing unit.
Specifically, the merger will check each edge read to determine if the current edge belongs to the Tile being processed, if so, assign a state value of only one bit and set to 1, if not, the state value is set to 0. Furthermore, only 4 bits are needed per Tile to meet the requirements, since there are only 16 processing units inside each Tile.
The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.