Disclosure of Invention
The embodiment of the application provides a scheduling method and a scheduling device of a chip system, which can enable a user to program a multi-node chip system in a single-chip mode, and improve the usability and compatibility of programming while guaranteeing the execution performance of the multi-node.
In order to achieve the above purpose, the following technical scheme is adopted in the embodiment of the application.
According to a first aspect, a scheduling method of a chip system is provided, the chip system comprises a plurality of bare chips, each bare chip is a node, when a processor sends a computing task to the chip system, the processor divides the computing task into a plurality of subtasks and determines logic resources required by the subtasks, the processor determines task descriptions of the subtasks executed by the plurality of nodes in the chip system, the plurality of nodes comprise a master node and at least one slave node, the task descriptions corresponding to the subtasks are used for indicating the subtasks and physical resources required by executing the subtasks, the processor sends the task descriptions corresponding to the subtasks respectively to the master node, the processor sends the subtasks corresponding to the subtasks respectively to the master node, wherein the subtasks comprise the subtasks of the master node and the subtasks of the at least one slave node, each subtask of the at least one slave node corresponds to one slave node, and the master node sends the task description of each subtask of the subtasks of the at least one slave node to the corresponding at least one slave node respectively.
The application can be used in a multi-die (die) chip system. For example, the architecture of the chip system is the multi-NUMA architecture of AI chips. In the multi-NUMA architecture of AI chips, each AI chip can be understood as one NUMA node, and each AI chip can be a die, or a chip (chiplet), integrated into the multi-NUMA architecture chip. Each AI chip may perform a computing task, or referred to as an AI task, or as an AI computing task.
That is, in executing a computing task, the present application may segment the computing task into a plurality of subtasks and determine the logical resources required for the plurality of subtasks. In this way, when the logical resource is mapped to the physical resource and the physical resource to execute the plurality of subtasks is determined, the task descriptions of the plurality of subtasks can be obtained according to the plurality of subtasks and the physical resource assembly of the plurality of subtasks. The master node may then be selected among the plurality of nodes, the task descriptions of the plurality of subtasks being sent to the master node by the processor, and the scheduling and distribution of the plurality of subtasks being performed by the master node. In this way, the master node and at least one slave node need only perform distributed sub-tasks on physical resources, and cross-node data access is not required when executing single operators (single computing tasks). The user-programmed CPU program in the present application may use a system-on-chip in a single chip like a single chip, e.g., a CPU-programmed program may use a multiple AI chip NUMA architecture like a single chip. Therefore, a user can program in a single-chip mode, the usability and compatibility of CPU programming are improved while the execution performance of multiple nodes in a chip system is guaranteed, and the high-performance execution of a single operator in a multiple-node scene is completed. An operator is understood to be a computational task to be performed, among other things.
The method comprises the steps of dividing a computing task into a plurality of subtasks and determining logic resources required by the subtasks when the computing task is sent to the chip system, determining task descriptions of the subtasks to be executed by the nodes in the chip respectively, wherein the plurality of nodes comprise a master node and at least one slave node, the task descriptions corresponding to the subtasks are used for indicating the subtasks and physical resources required by the execution of the subtasks, sending the task descriptions corresponding to the subtasks respectively to the master node, wherein the subtasks comprise the subtasks of the master node and the subtasks of the at least one slave node, each subtask of the at least one slave node corresponds to one slave node, and the task descriptions of each subtask of the subtasks of the at least one slave node are sent to the corresponding at least one slave node respectively.
The second aspect may be performed by a processor in the device, which may send the split plurality of subtasks to the chip system. The advantages of the second aspect may be seen from the description of the advantages of the first aspect.
In the first and/or second aspect:
In one possible design, dividing the computing task into a plurality of subtasks includes dividing the computing task into a plurality of subtasks according to execution time consumption of a first operator corresponding to each of a plurality of task dividing modes of the computing task, wherein the execution time consumption of the first operator corresponding to the first task dividing mode of the computing task is used for indicating the time consumption of executing one subtask by a single node in the first task dividing mode, and the first task dividing mode is any task dividing mode.
An operator is understood to be a computing task to be performed, for example an AI task. That is, when determining the number of AI tasks to be split, the splitting principle thereof performs time-consuming comparison for the first operator under each splitting mode, so as to determine the first task splitting mode with shorter time consumption.
In one possible design, according to the execution time consumption of the first operators corresponding to the multiple task segmentation modes of the computing task, the computing task is segmented into multiple subtasks, wherein the method comprises the steps of determining the execution time consumption of the first operators corresponding to the multiple task segmentation modes of the computing task, determining a target task segmentation mode in which the execution time consumption of the first operator task meets the condition in the multiple task segmentation modes, and segmenting the computing task into multiple subtasks by adopting the target task segmentation mode.
The method can determine the target task segmentation mode meeting the conditions by comparing the execution time consumption of the first operator under each task segmentation mode with the conditions. Therefore, when the plurality of subtasks are segmented according to the target task segmentation mode, the plurality of nodes do not need to access cross-node data when executing a single operator (a single computing task, such as a single AI task), and time consumption for executing the task is saved. Therefore, for the NUMA architecture of multiple AI chips, when the application can use the NUMA architecture of multiple AI chips like using a single chip, a user can program in a single chip mode, the usability and compatibility of programming are improved while the execution performance of multiple NUMA nodes is ensured, and the high-performance execution of single operators under the scene of the multiple NUMA nodes is completed.
In one possible design, the number of subtasks is N, N is an integer greater than or equal to 2, the first operator execution time comprises (1/N) x total cross-node data copy time consumption of the computing task, (1/N) x total operator computation time consumption of the computing task and scheduling software and hardware time consumption, and the condition comprises (1/N) x total cross-node data copy time consumption of the computing task+scheduling software and hardware time consumption < (1/N) x total operator computation time consumption of the computing task.
Taking a computing task as an AI task, taking a NUMA architecture with multiple AI chips as an example, the total time consumption of the AI task for copying NUMA data can be understood as the total time consumption of input, output, read and write among NUMA nodes. Scheduling software and hardware time consumption may include copying data software time consumption and scheduling synchronization time consumption. The total operator computation time of an AI task can be understood as the computation time when a single NUMA node completes the AI task. If the AI task is segmented into N parts, N subtasks are obtained, and the current operator segmentation mode can be considered to be profitable under the conditions that the total NUMA data copying time of the 1/N AI task is less than the total operator calculation time of the scheduling software and hardware time < (1/N) x AI task. When N subtasks are executed synchronously on N NUMA, the operator performance cost (cost) is better.
In one possible design, splitting the computing task into a plurality of subtasks includes determining a second operator execution time when the computing task is executed by the single node, the second operator execution time being a maximum of a data read-write time and an operator computation time when the computing task is executed by the single node, and if the second operator execution time is greater than or equal to a time consumption threshold, executing splitting the computing task into the plurality of subtasks when the processor sends the computing task to the chip system.
Taking a computing task as an AI task, taking a NUMA architecture of a chip system with multiple AI chips as an example, when a single NUMA node executes the AI task, simultaneously executing data reading and writing and operator computing, and taking the maximum value in the two as the execution time of a second operator. If the execution time of the second operator when the AI task is executed on a single NUMA node is longer, the AI task can be segmented, and a plurality of segmented subtasks are synchronously executed by a plurality of NUMA nodes, so that the phenomenon that the single operator has cross NUMA data access when the single operator is executed on a plurality of NUMA nodes is avoided, and the single operator high-performance execution under the scene of the plurality of NUMA nodes is completed.
In one possible design, the processor determining task descriptions for a plurality of nodes in the chip system to perform a plurality of subtasks includes determining a plurality of nodes to perform the plurality of subtasks and determining physical resources required for each of the subtasks to be performed by each of the nodes based on logical resources required for the plurality of subtasks and determining task descriptions for each of the subtasks to be performed by each of the nodes based on the plurality of nodes and the physical resources required for each of the subtasks to be performed by each of the nodes. Taking a computing task as an AI task, taking a NUMA architecture with multiple AI chips as an example, a chip system determines a plurality of nodes for executing a plurality of subtasks, which are equivalent to determining which AI chips in the NUMA architecture execute the plurality of subtasks, and physical resources required by each subtask can be understood as storage resources, computing power resources, task execution resources and the like when each AI chip executes the subtask. And encapsulating the finally determined multiple AI chips and the physical resources of each subtask into task description of each subtask, and issuing the task description to the multiple AI chips. Multiple AI chips do not need to be accessed across NUMA nodes, and the time consumption for performing AI tasks is less. For programmers, a plurality of subtasks only need to be sent to the main NUMA node, so that a user can program a CPU in a single-chip mode, usability and compatibility of CPU programming are improved while performance of multi-node execution in a chip system is guaranteed, and high-performance execution of a single operator in a multi-node scene is completed.
In one possible design, the task description of the subtask corresponding to the master node is used to instruct the master node to send the task description of each of the subtasks of the at least one slave node to one of the corresponding at least one slave node. In this way, for the processor, when issuing a plurality of subtasks to a plurality of nodes, the main node does not need to be additionally instructed to distribute the subtasks besides the task description of the plurality of subtasks, and the task issuing efficiency is higher.
Or the method further comprises the step that the processor instructs the master node to send the task description of each sub-task of the sub-tasks of the at least one slave node to one of the corresponding at least one slave node. This corresponds to the fact that after the processor sends the task description of each sub-task to the master node, an indication is sent to instruct the master node to send the task description of the sub-task corresponding to the slave node to at least one slave node. In this way, the master node and at least one slave node may perform locally received subtasks, avoiding cross-node access.
In one possible design, before sending task descriptions corresponding to the plurality of subtasks to the master node, the method further comprises selecting a node with the smallest load degree of the plurality of nodes as the master node according to the load information of each node of the plurality of nodes. The load information is used for indicating the load degree of the node.
Taking the computing task as an AI task, the chip system is exemplified by a NUMA architecture with multiple AI chips, where the loading of the nodes can be understood as the affinity of the NUMA nodes. For example, multiple NUMA nodes may be ordered by small arrival at the load, and the least loaded NUMA node of the multiple NUMA nodes may be selected as the primary NUMA node. NUMA task loads can be understood as loads on the current task resources. Thus, if the NUMA node with the smallest load can be used as a main NUMA node to perform task scheduling, namely sub-tasks are distributed to the slave NUMA nodes, the load of each NUMA node in a chip of the multi-NUMA architecture can be balanced.
In one possible design, the load information includes, for each node, at least one of a task load on the current task execution resource of the node, a task queue length waiting to be executed, and historical load statistics. I.e. the load information of the nodes is affected by the load conditions of the nodes. For example, for a chip system of a NUMA architecture, when a main NUMA node is determined according to load information of a plurality of NUMA nodes, when a NUMA node with smaller load is selected as the main NUMA node, task scheduling efficiency can be higher, and loads of all the NUMA nodes can be balanced.
In one possible design, the method further includes the slave node synchronizing the execution state of the subtasks to the master node, the execution state including beginning execution, ending execution, or aborting. Thus, for example, for a chip system of the NUMA architecture, when the master NUMA node knows the execution state of each slave NUMA node, possible exceptions can be reported to the processor at any time, whether the task starts to execute, whether the task ends to execute, and the like, so that the processor can master the execution situation of the currently issued AI task.
The method comprises the steps of receiving task descriptions which are sent by a processor and correspond to the subtasks respectively, wherein the task descriptions corresponding to the subtasks are used for indicating the subtasks and physical resources required by executing the subtasks, the subtasks are obtained by dividing the calculation tasks by the processor, the subtasks comprise the subtasks of a master node and the subtasks of at least one slave node, each of the subtasks of the at least one slave node corresponds to one of the at least one slave node, and the task descriptions of each of the subtasks of the at least one slave node are respectively sent to the corresponding one of the at least one slave node.
The advantages of the third aspect can be seen from the description of the advantages of the first aspect.
In one possible design, the task description of the subtask corresponding to the master node instructs the master node to send the task description of each of the subtasks of the at least one slave node to one of the corresponding at least one slave nodes, respectively, and the method further includes receiving an indication of a processor instructing the master node to send the task description of each of the subtasks of the at least one slave node to one of the corresponding at least one slave nodes, respectively.
In one possible design, the execution state of the subtasks of at least one slave node is synchronized to the master node, the execution state including start execution, end of execution, or abort.
In a fourth aspect, a scheduling device is provided, which may be, for example, a processor. The scheduling device comprises a task segmentation unit, a resource mapping unit and a task scheduling unit, wherein the task segmentation unit is used for segmenting a computing task into a plurality of subtasks and determining logic resources required by the subtasks when the computing task is sent to a chip system, the resource mapping unit is used for determining task descriptions of the subtasks executed by a plurality of nodes in the chip system, the plurality of nodes comprise a master node and at least one slave node, the task descriptions corresponding to the subtasks are used for indicating the subtasks and physical resources required by the execution of the subtasks, the task scheduling unit is used for sending task descriptions respectively corresponding to the subtasks to the master node, the subtasks comprise the subtasks of the master node and the subtasks of the at least one slave node, each subtask of the subtasks of the at least one slave node corresponds to one slave node, and the task descriptions of the subtasks of the at least one slave node are respectively sent to the corresponding at least one slave node.
The advantages of the fourth aspect may be seen in the description of the advantages of the first aspect, and are not described here again.
In one possible design, the task segmentation unit is used for segmenting the computing task into a plurality of subtasks according to the execution time consumption of a first operator corresponding to each of a plurality of task segmentation modes of the computing task, wherein the execution time consumption of the first operator corresponding to the first task segmentation mode of the computing task is used for indicating the time consumption of a single node executing one subtask in the first task segmentation mode, and the first task segmentation mode is any task segmentation mode.
In one possible design, the task segmentation unit is used for determining the execution time consumption of a first operator corresponding to each of a plurality of task segmentation modes of the computing task, determining a target task segmentation mode in which the execution time consumption of the first operator meets the condition in the plurality of task segmentation modes, and segmenting the computing task into a plurality of subtasks by adopting the target task segmentation mode.
In one possible design, the number of subtasks is N, N is an integer greater than or equal to 2, the first operator execution time comprises (1/N) x the total cross-node data copy time of the computing task, (1/N) x the total operator computation time of the computing task and the scheduling software and hardware time, and the conditions comprise (1/N) x the total cross-data copy time of the computing task+the scheduling software and hardware time < (1/N) x the total operator computation time of the computing task.
In one possible design, the task scheduling unit is further configured to determine a second operator execution time consumption when the single node executes the computing task, where the second operator execution time consumption is a maximum value of a data read-write time consumption and an operator computation time consumption when the single node executes the computing task, and if the second operator execution time consumption is greater than or equal to a time consumption threshold, execute to split the computing task into a plurality of subtasks when the processor sends the computing task to a chip of the multiple non-unified memory access nodes.
In one possible design, the resource mapping unit is configured to determine a plurality of nodes for executing a plurality of subtasks, determine physical resources required by the subtasks executed by each node according to logical resources required by the plurality of subtasks, and determine task descriptions of the subtasks executed by each node according to the plurality of nodes and the physical resources required by the subtasks executed by each node.
In one possible design, the task description of the subtask corresponding to the master node is used for instructing the master node to send the task description of each of the subtasks of the at least one slave node to one of the at least one corresponding slave nodes, respectively, or the task scheduling unit is further used for instructing the master node to send the task description of each of the subtasks of the at least one slave node to one of the at least one corresponding slave nodes, respectively.
In one possible design, the task scheduling unit is further configured to select a node with the smallest load degree of the plurality of nodes as a master node according to load information of each of the plurality of nodes, where the load information is used to indicate the load degree of the node.
In one possible design, the load information includes, for each node, at least one of a task load on the current task execution resource of the node, a task queue length waiting to be executed, and historical load statistics.
In a fifth aspect, a chip system is provided, the chip system comprises a plurality of bare chips, each bare chip is a node, the method comprises a receiving unit, a receiving unit and a processing unit, wherein the receiving unit is used for receiving task descriptions respectively corresponding to a plurality of subtasks sent by a processor, the task descriptions corresponding to the subtasks are used for indicating the subtasks and physical resources required by executing the subtasks, the subtasks are obtained by dividing the computing tasks by the processor, the subtasks are executed by a master node and at least one slave node in the chip system, the subtasks comprise the subtasks of the master node and the subtasks of the at least one slave node, each subtask in the subtasks of the at least one slave node corresponds to one slave node in the at least one slave node, and the task descriptions of each subtask in the subtasks of the at least one slave node are respectively sent to the corresponding one slave node in the at least one slave node.
The advantages of the fifth aspect may be seen in the description of the advantages of the first aspect, and are not described here again.
In one possible design, the task description of the subtask corresponding to the master node instructs the master node to send the task description of each of the subtasks of the at least one slave node to one of the corresponding at least one slave nodes, respectively, or the receiving unit is further configured to receive an indication of the processor, the indication of the processor being configured to instruct the master node to send the task description of each of the subtasks of the at least one slave node to one of the corresponding at least one slave nodes, respectively.
In one possible design, the execution state of the subtasks of at least one slave node is synchronized to the master node, the execution state including start execution, end of execution, or abort.
In a sixth aspect, a scheduling device of a system comprises a processor and a chip system, wherein the chip system comprises a plurality of bare chips, each bare chip is a node, the processor is used for dividing a computing task into a plurality of subtasks and determining logic resources required by the subtasks when the computing task is sent to the chip system, the plurality of nodes in the chip system are determined to execute task descriptions of the subtasks, the plurality of nodes comprise a master node and at least one slave node, the task descriptions corresponding to each subtask are used for indicating the subtasks and physical resources required by executing the subtasks, the task descriptions corresponding to each subtask are sent to the master node, the plurality of subtasks comprise the subtasks of the master node and the subtasks of the at least one slave node, each subtask of the at least one subtask corresponds to one slave node, and the master node is used for sending the task descriptions of each subtask of the at least one subtask of the slave node to the corresponding one slave node.
In a seventh aspect, an electronic device is provided that includes a memory and a processor. The memory is coupled to the processor. The memory is for storing computer program code, the computer program code comprising computer instructions. The transceiver is used for receiving data and transmitting data. When executed by a processor, causes the electronic device to perform any of the methods of scheduling a chip system as provided in the first aspect or its corresponding possible designs.
In an eighth aspect, the present application provides a chip system, which is applied to an electronic device. The system-on-chip includes one or more interface circuits, and one or more processors. The interface circuit and the processor are interconnected by a line, the interface circuit being adapted to receive a signal from the processor, the signal comprising computer instructions stored in the memory. When the processor executes the computer instructions, the electronic device performs the method as in the first and/or second aspects and any of the possible designs of the first and/or second aspects described above.
In a ninth aspect, there is provided a communications device comprising at least one processor, the at least one processor being connected to a memory, the at least one processor being arranged to read and execute a program stored in the memory to cause the device to perform a method as in the above first and/or second aspect and any one of the possible designs of the first and/or second aspect.
In a tenth aspect, there is provided a chip coupled to a memory for reading and executing program instructions stored in the memory to implement the method as in the first and/or second aspects and any one of the possible designs of the first and/or second aspects described above.
In an eleventh aspect, an embodiment of the present application provides a scheduling apparatus, where the apparatus is included in an electronic device, and the apparatus has a function of implementing the foregoing first aspect and/or second aspect and any one of possible designs of the first aspect and/or the second aspect. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules or units corresponding to the functions described above. Such as a task segmentation module or unit, a resource mapping module or unit, and a task scheduling module or unit.
In a twelfth aspect, embodiments of the present application provide a computer readable storage medium comprising computer instructions which, when run on an electronic device, cause the electronic device to perform the method of the first and/or second aspects and any one of the possible designs of the first and/or second aspects described above.
In a thirteenth aspect, embodiments of the present application provide a computer program product which, when run on a computer or processor, causes the computer or processor to carry out the method of the first and/or second aspects and any one of the possible designs of the first and/or second aspects described above.
It is to be understood that any of the above-mentioned electronic devices, scheduling apparatuses, chip systems, computer-readable storage media or computer program products may be applied to the corresponding methods provided above, and thus, the advantages achieved by the above-mentioned electronic devices, scheduling apparatuses, chip systems, computer-readable storage media or computer program products may refer to the advantages in the corresponding methods and are not described herein.
These and other aspects of the application will be more readily apparent from the following description.
Detailed Description
For ease of understanding, a description of some of the concepts related to the embodiments of the application are given by way of example for reference. As shown below.
Non-uniform memory access (NUMA) architecture is a computer memory design for multiple processors, where the memory access time depends on the memory location of the processor. Under NUMA, a processor accesses local memory faster than non-local memory.
An artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip, which may be referred to as an AI accelerator or a computing card, i.e., a module dedicated to accelerating a large number of computing tasks in an AI application (other non-computing tasks are still responsible for by the CPU). In a broad sense, the chips that are oriented to AI computing applications may be referred to as AI chips.
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. In the description of the embodiment of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B, and "and/or" herein is merely an association relationship describing an association object, which means that three relationships may exist, for example, a and/or B, and that three cases, i.e., a alone, a and B together, and B alone, exist. In addition, in the description of the embodiments of the present application, "plurality" means two or more than two.
The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present embodiment, unless otherwise specified, the meaning of "plurality" is two or more.
Chiplet (chip) technology is understood to mean the integrated packaging of several pre-manufactured dies (die) that perform specific functions together by means of an integration technique to form a system chip. Where these basic dies can be understood as chiplets.
If the chiplet technology is applied to the AI domain, each die can be understood as one AI chip.
AI chips include AI acceleration chips (AI computation acceleration for a specific class of algorithms or scenarios based on conventional chip architecture) typified by graphics processors (graphics processing unit, GPU), field-programmable gate arrays (FPGA), application SPECIFIC INTEGRATED Circuits (ASIC). The AI chip may also include a neural processing unit (neural processor unit, NPU) and also studies of comparative frontier, such as brain-like chips, reconfigurable general AI chips, and the like.
For example, GPUs operate on multiple data in a single instruction multiple data stream (single instruction multipledata, SIMD), i.e., one instruction, with a large number of compute units and an extremely long graphics image processing pipeline. Because the GPU is internally provided with a plurality of transistors which can form various special circuits and a plurality of pipelines, the calculation speed of the GPU is far higher than that of a CPU, and the GPU has stronger floating point operation capability, so that the training difficulty of a deep learning algorithm can be relieved, and the AI potential is released, and the GPU is widely used in the field of the deep learning algorithm. GPUs lack complex arithmetic logic units and must be scheduled by the CPU.
The FPGA can lock the instruction on the hardware architecture, then use the hardware instruction stream to run data, simply understand that the AI computing architecture is realized by a hardware circuit, then continuously input the data stream into the system, and complete the computation. Unlike a GPU, an FPGA can possess both hardware pipeline parallelism and data parallel processing capabilities, and is adapted to process data streams in a hardware pipeline manner, and thus is well adapted to AI reasoning phases, with significant performance or energy consumption advantages over a CPU and GPU.
The ASIC may implement AI-specific algorithms, requiring custom chips. So-called customization, i.e., an architecture specifically designed for AI algorithms, can help to improve chip performance and power consumption ratio.
NPU can adopt the architecture of data-driven parallel computation, and is good at processing massive multimedia data of video and image types. In some scenarios, the NPU may be used for internet of things (internet of things, ioT) artificial intelligence to accelerate the operation of the neural network, solving the problem of inefficiency in the traditional neural network operation.
In the embodiment of the present application, as shown in fig. 1, a schematic diagram of a deployment scenario of an AI chip is shown, where the AI chip may be typically deployed in an electronic device, and the electronic device may be, for example, a cloud end or a terminal. Depending on the location of deployment, the AI chips may be classified into cloud AI chips (or also known as cloud AI chips) and end AI chips (or also known as end AI chips). Of course, the application is not limited and can be deployed in other devices
The cloud, namely the data center, requires extremely large data volume and large operation volume in the training stage of deep learning, and a single processor cannot independently complete, so that the training link can be realized in the cloud.
Terminals, i.e., smart devices that perform edge calculations, such as cell phones, security cameras, automobiles, smart home devices, various IoT devices, and the like. The number of terminals is huge and the difference in demand is large.
The cloud AI chip has the characteristics of strong performance, capability of supporting a large amount of operations at the same time, and capability of flexibly supporting different AI applications such as pictures, voice, video and the like. Based on the cloud AI chip technology, various intelligent devices and cloud servers can be rapidly connected, and the connection can be kept stable maximally. The terminal AI chip is characterized by small volume, low power consumption, no special strong performance, and usually only one or two AI capabilities are required to be supported. Compared with a cloud AI chip, the terminal AI chip needs to be embedded into the device, after the terminal AI chip is embedded into the device, the AI capability of the device can be further improved, and the device can use the corresponding AI capability without networking, so that the AI coverage becomes more comprehensive.
The implementation of AI involves two links, training and reasoning, in terms of the latitude of the task that AI assumes. Therefore, the AI chip can be divided into a training chip and an inference chip according to different tasks, generally speaking, the training chip is mainly used for constructing a neural network model, and the inference chip is mainly used for performing inference prediction by using the neural network model.
With the increasing appeal of computational power for a single AI chip, the use of NUMA architecture in AI chips has become a trend. Since the resource management and scheduling of the NUMA architecture of the CPU is well established, the resource management and scheduling of the NUMA of the CPU will be described herein for the sake of understanding that the AI chip uses the NUMA architecture. FIG. 2 is a NUMA architecture diagram of a CPU, which shows 4 CPUs, CPU0 through CPU3, each CPU including a local memory, the 4 CPUs corresponding to memory A through memory D, respectively. The program executed by the CPU may be a program programmed by the user on the CPU or may be a user driver (driver package/accelerator package). The user programming the CPU may be referred to as the user of the CPU (CPU process). The user driver does not need the user programming of the CPU, and the user driver can be a driver package of an accelerator card, for example, and can be loaded on the CPU for running.
The CPU programming program can determine the calculation power of the CPU NUMA framework and the maximum use efficiency of the memory. For example, a program programmed by a user to a CPU may select one of the CPUs, and a task execution request is generated, and simultaneously, a memory is applied to the CPU for receiving a task and storing intermediate data of task execution and result data of execution, so as to avoid performance degradation caused by remote access (remote access) as much as possible. Once the program programmed to the CPU determines the size of the memory required when the task cannot be executed by one CPU, the program programmed to the CPU may select another CPU or select two CPUs with optimal topology (topo) structures, and perform the task execution request and apply for the memory, respectively. That is, programs that program the CPU can reduce the computational power and memory reduction that is associated with multi-NUMA architectures by avoiding remote access. Wherein each CPU is understood to be a CPU NUMA node (node), the CPU NUMA architecture includes a plurality of CPU NUMA nodes.
For the scheduling mechanism of CPU NUMA, the program that programs the CPU knows that there are multiple CPU NUMA nodes in the CPU NUMA architecture. The sequence of scheduling and memory application of the CPU programming program to the CPU NUMA nodes is that scheduling is performed firstly, namely the program programming the CPU selects one CPU in the CPU NUMA architecture to run for the task to be executed, and then applies memory to a plurality of CPUs. At this time, according to the CPU selected by the scheduling, the CPU with the shortest path to the selected CPU is selected to apply for the memory. The tasks of each process of the CPU are issued, scheduled and executed on the same CPU NUMA node selected by the schedule, thereby avoiding the scene of scheduling across the CPU NUMA nodes. The CPU with the shortest path is understood as the CPU with the shortest path according to CPU NUMA affinity.
Based on the above, the initiative of fully exerting the computation power and the memory under the CPU NUMA architecture is given to a program for programming the CPU, or a CPU user, so that the complexity of the CPU programming is increased. Moreover, a task is executed on one CPU NUMA node, and the requirement that a plurality of NUMA nodes together complete a task cannot be fulfilled.
In the NUMA scenario of AI chips, as shown in FIG. 3, a NUMA architecture of an AI chip is shown, the NUMA architecture includes 4 NUMA nodes, AI chips 1-4, each AI chip being a NUMA node. Each NUMA node has local memory disposed therein that is accessible by other NUMA nodes, so that the memory of each NUMA node can include a remote cache. In order to achieve the purpose that the CPU accesses the multiple AI chips like a CPU calls a single chip (one chip), the industry calls the multiple AI chips in a single chip mode by increasing the cache space of a remote cache between NUMA nodes and improving the access bandwidth between NUMA nodes. However, the essence of this approach is that the bandwidth and remote cache between multiple NUMA nodes reach the level of unified memory access (uniform memory access, UMA), which has high requirements on the hardware process of the AI chip, and cannot be followed under the condition of limited hardware process.
Specifically, in the scenario where the CPU calls multiple AI chips in a single chip manner, since the NUMA architecture of the AI chip schedules, the memory application and task release are completed by the user of the AI chip, such as the user driver mentioned above. When the user driver uses the NUMA architecture of the AI chip to execute the task, the user driver applies for the memory to the NUMA architecture of the AI chip and issues the task to the AI chip. That is, when the user driver of the AI chip invokes a logic device composed of multiple AI chips (equivalent to a single chip including multiple groups of logic devices for executing multiple tasks, where multiple AI chips for executing the same task are a group of logic devices), and cannot directly invoke a NUMA node of a single AI chip, the user driver will first apply for a memory on the AI chip, then issue a task to the AI chip, and need to find a task scheduler according to the AI chip where the memory is located. For example, FIG. 4 is a diagram illustrating the scheduling of a NUMA architecture for AI chips. When the CPU0 loads a user driver running a NUMA architecture of the scheduled AI chip, each AI chip may be understood as a NUMA node, a local memory is disposed in each AI chip, and the local memory may be accessed by a remote AI chip after remote access (remote access). At this time, a plurality of AI chips, for example, AI chip 1 and AI chip 2, constitute one logic device. Assuming that when the CPU0 applies for the memory in advance for a task to be executed, the AI chip related to the applied memory comprises an AI chip 1 and an AI chip 2, if the CPU0 issues the task to the AI chip 1, the CPU0 needs to find task schedulers corresponding to the AI chip 1 and the AI chip 2 respectively, so that the AI chip NUMA1 can access the local memory, and also access the local memory of the AI chip NUMA2 through remote access, and cross-NUMA node scheduling exists. However, this method has a high hardware process requirement on the AI chip, and there are cases where the hardware process is limited.
Therefore, the embodiment of the application provides a scheduling method of a chip system, which can be applied to the chip system, wherein the chip system comprises a plurality of chips (for example die). For example, the architecture of the chip system is a multi-NUMA node architecture of an AI chip, and through resource management of operator slices and master-slave (master-slave) scheduling coordination of the AI chip, the phenomenon that single operators have cross-NUMA data access when the multi-NUMA is executed is avoided, and high-performance execution of the single operators under a multi-NUMA node scene is completed. Wherein an operator can be understood as a computational task. In the multi-NUMA architecture of a chip, an operator can be understood as an AI task to be performed.
It should be noted that, the NUMA architecture to which the present application is applied is merely illustrative, and the scheme provided by the present application can be applied to various architectures, so long as the architecture is a multi-AI chip architecture, and the scheme of the present application can be applied. That is, the architecture of multiple AI chips is called NUMA architecture, and in the following evolution process, the architecture of multiple AI chips may have other architecture names, which are not limited by the present application.
In the method embodiment of the present application, as shown in fig. 5, a software module schematic diagram provided in the embodiment of the present application may include a multi-node memory management and task segmentation module 501, a runtime task distribution module 502, and a chip scheduling module 503.
The multi-node memory management and task segmentation module 501 may be configured to determine the number of subtasks that may be segmented in a computing task, and segment the computing task to obtain a plurality of subtasks and logic resources required by the segmented plurality of subtasks. For example, the logical resources include computing power resources, storage resources, and task execution resources that logically execute a plurality of subtasks. Each chip in the system-on-chip may be considered a node, i.e., a single computing task may be split into multiple sub-tasks that are co-operative by multiple nodes.
The runtime task distribution module 502 may be used to complete the mapping of logical resources to physical resources. In addition, the computing task can determine task descriptions of all nodes according to the affinity principle, and the task descriptions of all nodes can be sent to a plurality of nodes.
The chip scheduling module 503 may be configured to control a master node (master) of the plurality of nodes to distribute a task description (for indicating subtasks and physical resources) of a slave (slave) node to at least one slave node, so that the at least one slave node and the master node may coordinate to complete a computing task. For example, in a multi-NUMA architecture of a chip, the chip system includes NUMA nodes 1-4 shown in FIG. 5, including a master NUMA node and a plurality of slave NUMA nodes of the determined plurality of nodes.
From the hardware architecture, the software program of the multi-node memory management and task segmentation module 501 and the runtime task distribution module 502 can run on the CPU that issues the computing task, and the software program of the chip scheduling module 503 can run on the chip that is the master node.
It has been mentioned above that the program running on the CPU may be a program that the user programs the CPU, or may be a user driver (driver package/acceleration package). The chip system corresponding to the NUAM architecture of the AI chip described above can be understood, for example, as an accelerator card. In the case of an accelerator card installed on a device, the software program corresponding to the accelerator card may be understood as the user driver of the accelerator card. When the CPU schedules the chip system, the CPU loads and runs the user driver. The CPU performs the functions of the multi-node memory management and task segmentation module 501 and the runtime task distribution module 502 described above, i.e., performs task segmentation and election of master and slave nodes, during the process of running the user driver. The user driver then controls the master node to execute the function of the chip scheduling module 503 on the chip system according to the elected master node and the at least one slave node, i.e. distributes the subtasks corresponding to the at least one slave node.
It should be appreciated that although the CPU may run the above-described user-driven program, the node memory management and task segmentation module 501 and the functions of the runtime task distribution module 502 are not known to the program programmed by the user, or the program programmed by the user is not used to implement the functions of the memory management and task segmentation module 501 and the runtime task distribution module 502. The user driver does not need to feed back the segmentation mode of tasks and the election result of the master node and the slave node to the program programmed by the user. The program programmed by the user on the CPU is still like a scheduling single chip (one chip) scheduling chip system, so that the user can program the CPU in a single chip mode, the usability and compatibility of CPU programming are improved while the execution performance of multiple nodes in the chip system is ensured, and the high-performance execution of a single operator in a multiple-node scene is completed.
With the above-described software and hardware architecture, embodiments of the present application are described below.
Fig. 6 is a schematic flow chart of a scheduling method based on a chip system according to an embodiment of the present application, where the chip system in the embodiment of the present application is a chipset or a chipset formed by a plurality of chips, and for a CPU, the chip system is a "chip" (one chip), so that a processing manner of accessing or scheduling by the CPU to the chip system is the same as a processing manner of accessing or scheduling by the CPU to a certain chip. The following description uses the architecture of the chip system as a multi-NUMA architecture of the AI chip, and the computing task is an AI task, which is also understood as an AI computing task or an AI work task.
The method comprises the following flow.
601. When the processor sends a computing task to the system-on-chip, the processor splits the computing task into a plurality of subtasks and determines the logical resources required by the plurality of subtasks.
In some embodiments, the processor and the chip system described above may be included in a scheduling apparatus, which may be included in an electronic device.
In some embodiments, before any CPU in the electronic device sends an AI task to a one chip (chip system) composed of multiple AI chips, the CPU may first select, according to a performance evaluation algorithm/policy, a task splitting manner with optimal performance, that is, determine the number of pieces that the AI task can be split, when loading a user driver running the chip system. In other words, the number of NUMA nodes that execute a plurality of sub-tasks of an AI task is determined, or the number of AI chips (die) that execute the AI task is determined.
In some embodiments, the basis for the performance evaluation algorithm/policy may be operator execution time consuming. Thus, splitting the computing task into a plurality of subtasks may include splitting the computing task into a plurality of subtasks according to the time consuming execution of the first operators respectively corresponding to the plurality of task splitting manners of the computing task. The method comprises the steps of calculating the execution time consumption of a first operator corresponding to a first task segmentation mode of a task, wherein the first operator is used for indicating the time consumption of executing a subtask by a single node in the first task segmentation mode, and the first task segmentation mode is any task segmentation mode.
That is, the processor may select a task segmentation mode with a better execution time of the first operator according to the execution time of the first operator when the single node executes one subtask in each task segmentation mode, so as to segment the computing task into a plurality of subtasks.
In some embodiments, the processor may divide the computing task into a plurality of subtasks according to the time consumed by executing the first operators corresponding to the plurality of task division modes of the computing task, where the method may include:
Determining the execution time consumption of a first operator corresponding to each task segmentation mode of a computing task, determining a target task segmentation mode in which the execution time consumption of the first operator meets the conditions in a plurality of task segmentation modes, and segmenting the computing task into a plurality of subtasks by adopting the target task segmentation mode.
That is, the application can screen out the target task segmentation mode meeting the conditions from a plurality of task segmentation modes by setting the conditions. And dividing the computing task into a plurality of subtasks according to the screened target task dividing mode.
For example, the processor may attempt to calculate from splitting the calculation task into 2 subtasks, obtain the execution time consumption of the first operator corresponding to each subtask, and determine whether the execution time consumption of the first operator meets the condition. If the condition is met, determining to divide the computing task into 2 subtasks. If the condition is not met, the processor tries to calculate after dividing the calculation task into 3 subtasks, so as to obtain the execution time consumption of the first operator corresponding to each subtask, and judges whether the execution time consumption of the first operator meets the condition. And by analogy, determining a target task segmentation mode meeting the conditions.
In some embodiments, assume that the number of subtasks is N, which is an integer greater than or equal to 2:
the first operator execution time consuming may include:
(1/N) x computing task total cross-node data copy time consumption, (1/N) x computing task total operator computing time consumption and scheduling software and hardware time consumption;
the conditions comprise (1/N) multiplied by the total cross-node data copying time consumption of the computing task and the time consumption of the scheduling software and hardware < (1/N) multiplied by the total operator computing time consumption of the computing task.
Specifically, in the multi-NUMA architecture of the AI chip, if the AI task is split into N sub-tasks, the above-mentioned time consuming for copying NUMA data can be understood as the total time consuming for inputting and outputting read/write by one NUMA node to the rest N-1 NUMA nodes, if N sub-tasks are executed by N NUMA nodes simultaneously, the time consuming for copying data of 1 NUMA node can be the total time consuming for copying NUMA data of 1/N.AI task.
The above-mentioned scheduling software and hardware time consumption can be understood as the sum of the software time consumption and the hardware scheduling synchronization time consumption of 1 NUMA node for data copying.
The operator computation time is understood to be the time consumed by 1 NUMA node to perform the entire AI task computation. If N subtasks are performed simultaneously by N NUMA nodes, the operator computation time for 1 NUMA node can be a (1/N) x operator computation time.
If the total (1/N) x AI task copying time of the NUMA data and the total scheduling software and hardware time are < (1/N) x AI task operator calculation time, the time consumption of data reading and writing required when the AI task is segmented into N subtasks can be considered to be smaller, the operator segmentation theory is profitable, and the performance of jointly completing the AI task (single operator) by the N NUMA nodes is better.
When N NUMA nodes concurrently execute corresponding subtasks, the total operator execution time (including the cross NUMA data copy time and the theoretical computation time of the operator) when the respective subtasks are executed by the N NUMA nodes can include (N-1)/N AI task total cross NUMA data copy time + scheduling software and hardware time + (1/N) AI task total operator computation time. That is, when making cross-NUMA data copies, the cross-NUMA data copy time consumption of one NUMA node can include the sum of the time consumption of 1 NUMA node making cross-NUMA data copies to N-1 NUMA nodes.
Of course, the above conditions are merely exemplary, and the present application is not limited to the condition content of the screening target task segmentation method.
In some embodiments, the logical resources required by the processor to slice the resulting multiple subtasks may be considered as the logical resources required when executing the computing tasks on onechip, i.e., a single chip. Logical resources may include logically computing power resources, storage resources, and task execution resources.
The computing power can be understood as the computing power for realizing the output of the target result by processing the information data. In the NUMA architecture of the AI chip of the application, the AI chip is the main carrier of calculation force. The logical computing power resource can be understood as the number of NUMA nodes required for executing a plurality of subtasks after the AI task is segmented, namely the number of AI chips. Assuming that the number of the plurality of subtasks is N, the logically computing power resources correspond to N NUMA nodes accordingly.
For example, the AI task is to perform graphics computation, and each AI chip may be die on the GPU, and the logical operator resource corresponds to determining the number of die on the GPU.
The logical memory resources may be understood as the amount of memory space that each subtask requires when the corresponding AI chip executes.
Logical task execution resources can be understood as the number of internal task concurrency streams (streams). Wherein concurrent flows may be understood as flows supporting concurrent operations.
The task execution resource is understood as a task scheduler, which comprises a predefined task manager for executing general tasks and a custom task manager for defining own tasks, or can be used for organizing and defining one or more time-consuming tasks from different types of programs.
In some embodiments, when the processor performs the task segmentation on the computing task, the task segmentation may be performed according to the execution capability of the node in the chip system, so that the execution time consumption of the plurality of subtasks on the plurality of selected nodes may be ensured as much as possible. Therefore, when the task segmentation mode is selected, the time consumption of one subtask after segmentation can be compared by a single node, and the target task segmentation mode can be selected.
Step 601 may be executed by the multi-node memory management and task segmentation module 501 in the above-mentioned software architecture, and executed by the CPU issuing the computing task in the hardware architecture.
602. The processor determines task descriptions of a plurality of subtasks to be executed by a plurality of nodes in the chip system, wherein the plurality of nodes comprise a master node and at least one slave node, and the task descriptions corresponding to the subtasks are used for indicating the subtasks and physical resources required for executing the subtasks.
That is, before performing subtask scheduling, the processor needs to map the logic resources required by the subtasks into the physical resources of the nodes, and obtain task descriptions corresponding to the subtasks respectively. The plurality of subtasks include a subtask of the master node and a subtask of at least one slave node, each of the subtasks of the at least one slave node corresponding to one of the at least one slave node.
Step 601 corresponds to just splitting the computing task into multiple subtasks, and determining the logic resources required for executing the split multiple subtasks on the single chip, and further completing mapping and conversion from the logic resources to the physical resources of the single chip. I.e. the physical computing resources, storage resources and task execution resources need to be determined.
In some embodiments, the processor may map and translate logical resources to physical resources according to affinity principles and assemble task descriptions for multiple subtasks on corresponding nodes. The task description of each node is used to instruct the subtasks the node distributes to and the physical resources that perform the subtasks.
In some embodiments, the processor may first determine a plurality of nodes to perform a plurality of subtasks and determine the physical resources required for each subtask performed by each node based on the logical resources required for the plurality of subtasks. The task descriptions of the subtasks performed by each node are determined based on the plurality of nodes and the physical resources required for the subtasks performed by each node.
A mapping and conversion scheme of logical resources to physical resources is shown in fig. 7. Assume that in the multi-NUMA architecture of the AI chip, the AI task is split into N shares, resulting in N sub-tasks. Step 602 corresponds to mapping the logic resources of a single chip (the logic resources required by the single chip to execute the AI task before the AI task is split) to physical resources on N NUMA nodes (NUMA nodes 1-NUMA node N).
The mapping of the logical computing power resources to the physical computing power resources is equivalent to determining the identities of N NUMA nodes corresponding to the N subtasks and the acceleration computing resources on each NUMA node. That is, it is determined which of the N subtasks are to be executed on which NUMA nodes in particular, and which acceleration computing resources of each NUMA node are occupied.
In some embodiments, there may be multiple ways to map from a computational resource on a logical resource to a physical NUMA node. For example, assuming that the number of subtasks determined in step 601 is 4, step 602 needs to determine on which 4 NUMA nodes the 4 subtasks are to be executed in particular, and which acceleration computing resources of the 4 NUMA nodes are occupied. The system comprises a plurality of NUMA nodes, wherein 4 AI chips which are optimal according to the overall system load of 4 NUMA nodes of the NUMA architecture or meet the load requirement and have the minimum system load are used as the 4 selected NUMA nodes, and node identifications of the 4 NUMA nodes are acquired. By way of example, the future possible load conditions can be estimated according to the historical task load of each NUMA node on the single chip, and the current load condition of each NUMA node is subjected to weighted balance calculation, so that the load weight of each NUMA node is obtained, and 4 NUMA nodes with loads meeting the requirements are selected according to the load weight.
The acceleration computing resources on each NUMA node are different for different AI tasks for the physics to be determined.
By way of example, assuming AI tasks are the tasks of performing neural network computations, each NUMA node is die on the NPU, and the accelerated computing resources on each NUMA node may include AICore, AICPU, and digital video pre-processing (DVPP), etc.
Wherein AICore can be understood as the core tensor (tensor) computation unit, the convolution computation of the load matrix tensor, there may be, for example, 16 x 16 half floating point precision/cycle on each AICore, or 256 half floating point precision/cycle of vectorUnit, etc.
AICPU can be understood as physical computing units.
DVPP integrates the functions of a virtual private cloud (virtual private cloud, VPC), a JPEG encoder (joint photographic experts group encoder, JPEGE), a JPEG decoder (JPEG decoder, JPEGD), a portable network graphics decoder (portable network graphics decoder, PNGD), video Decoding (VDEC), and video encoding (velocity encoding, VENC), among others.
By way of further example, assuming that the AI task is a task to perform graphics computations, each NUMA node is die on the GPU and the accelerated computing resources on each NUMA node can include a stream multiprocessor (STREAMING MULTIPROCESSOR, SM) or stream processor (STREAMING PROCESSOR, SP). In some examples, several SPs may be attached with some other units to make up an SM. One SM can be seen as the basic unit of GPU computing scheduling. That is, for one NUMA node integrated on the GPU, one NUMA node can include multiple SMs that can support concurrent execution of up to several hundred thousands of threads (threads). In some examples, one SM may be provided with 64 cores for calculation.
For mapping logical storage resources to physical storage resources, it is understood that determining the memory address range occupied by the size of the memory space required for each subtask on the corresponding NUMA node. For example, in some embodiments, determining the memory address range occupied on each NUMA node amounts to determining the address range of the high bandwidth memory (high bandwidth memory, HBM) memory from an underlying implementation.
For mapping logical task execution resources to physical task execution resources, it can be understood as determining a hardware task queue corresponding to the number of task concurrency streams (streams) for each subtask on a single NUMA node.
In some embodiments, after determining the physical resources of the plurality of subtasks on the corresponding NUMA nodes, the task descriptions of the NUMA nodes corresponding to the plurality of subtasks may be determined according to the plurality of subtasks and the physical resources corresponding to each subtask, and the task descriptions of the plurality of NUMA nodes may be sent to the chip scheduling module 503.
Step 602 may be performed by the runtime task distribution module 502 in a software architecture, performed by a CPU in a hardware architecture that issues computing tasks.
603. And the processor sends task descriptions corresponding to each of the plurality of subtasks to the main node.
The plurality of subtasks include a subtask of the master node and a subtask of at least one slave node, each of the subtasks of the at least one slave node corresponding to one of the at least one slave node.
For example, after task descriptions of a plurality of nodes corresponding to a plurality of subtasks are obtained, from a hardware architecture, when a user driver of a chip system is running, a CPU may send task descriptions of a plurality of nodes to a master node of the plurality of nodes.
Therefore, before the CPU issues task descriptions of a plurality of nodes to the master node, the user driver of the chip system operated by the CPU needs to select the master node from the plurality of nodes, and the CPU subsequently interacts with the master node. The nodes other than the master node among the plurality of nodes are at least one slave node. At least one slave node can interact with the master node, and the slave node is not perceived by the CPU when the CPU performs a plurality of sub-task scheduling.
In some embodiments, before the processor sends the task descriptions respectively corresponding to the plurality of subtasks to the master node, the method may further include selecting, by the processor, a node with the smallest load degree among the plurality of nodes as the master node according to the load information of each of the plurality of nodes. The load information is used for indicating the load degree of the node.
The process of determining the master node may be performed by a runtime task distribution module 502 in the software architecture, performed by a CPU in the hardware architecture that issues computing tasks.
In some embodiments, the load information includes at least one of a task load on the current task execution resource of the node, a task queue length waiting to be executed, and historical load statistics.
For example, the CPU may determine the load of each NUMA node according to the task load on the current task execution resource, the task queue length waiting to be executed, and the historical load statistics value of each NUMA node, and the weights corresponding to the respective NUMA nodes, so as to select the NUMA node with the smallest weighted value as the NUMA node with the smallest load, that is, the main NUMA node.
The application is not limited to implementations in which a primary NUMA node is elected from among a plurality of NUMA nodes.
After the processor determines a primary NUMA node of the plurality, the CPU can send a task description of the plurality of NUMA nodes to the primary NUMA node. The task description corresponding to each NUMA node is used for indicating subtasks to be executed and physical resources occupied by executing the subtasks.
604. The master node sends the task description of each of the sub-tasks of the at least one slave node to one of the corresponding at least one slave node, respectively.
By "corresponding" is understood here that at least one slave node corresponds one-to-one to the task description of the sub-task of at least one slave node, and one slave node corresponds one to the task description of the sub-task of one slave node.
In some embodiments, the task descriptions of the subtasks corresponding to the master node are used to instruct the master node to send the task descriptions of each of the subtasks of the at least one slave node to one of the corresponding at least one slave node, respectively. Thus, signaling interaction resources between the processor and the chip system can be saved.
Or the method further comprises the step that the processor instructs the master node to send the task description of each sub-task of the sub-tasks of the at least one slave node to one of the corresponding at least one slave node. In addition to the task descriptions of the plurality of subtasks sent to the master node by the processor, an instruction can be sent to the master node to instruct the master node to respectively send the subtasks to at least one slave node. Therefore, when the task description of each node is obtained, the task description of the main node does not need to be further modified, and the task description generation mode is simpler.
In some embodiments, in the multi-NUMA architecture of the AI chip, when the main NUMA node receives the task description corresponding to each NUMA node, the main NUMA node distributes the task description of one sub-task to the sub-NUMA nodes respectively, and meanwhile, the task description of one sub-task is also stored in the main NUMA node. The master NUMA node and the slave NUMA node perform their respective subtasks simultaneously. Between the main NUMA node and the slave NUMA node, the slave NUMA node and the slave NUMA node do not need to perform data access between the NUMA nodes, and each NUMA node only needs to execute own subtasks.
Fig. 8 is a schematic diagram of a plurality of subtasks commonly performed by a multi-NUMA node to perform AI tasks according to an embodiment of the present application. Assuming that the AI task is split into 4 sub-tasks, after task assembly is completed and task descriptions of the 4 sub-tasks on the corresponding NUMA nodes are obtained, the runtime task distribution module 502 first elects a main NUMA node, and then sends task descriptions corresponding to the 4 NUMA nodes (the main NUMA node, the slave NUMA node 1, the slave NUMA node 2 and the slave NUMA node 3) to the main NUMA node, that is, issues tasks to the main NUMA node.
For the chip scheduling unit 503 in the master NUMA node, after receiving the task description from the runtime task distribution module 502, the sub-tasks corresponding to each slave NUMA node can be distributed to the slave NUMA node 1, the slave NUMA node 2, and the slave NUMA node 3 through the scheduling queue unit (scheduler queue element, SQE), and a part of the sub-tasks is reserved in the master NUMA node. In some embodiments, the subtasks may be distributed to each slave NUMA node randomly or according to a predetermined rule by the master NUMA node, which is not limited by the present application.
For example, when the number of the slave NUMA nodes is multiple, the master NUMA node can send task descriptions corresponding to the slave NUMA nodes to the multiple slave NUMA nodes at the same time, so that the sub-task distribution efficiency is high. Or the master NUMA node can send task descriptions corresponding to the slave NUMA nodes to the plurality of slave NUMA nodes in turn.
Wherein, for each NUMA node in the plurality, a pre-fetch buffer (prefetch buffer) module and an accelerator task execution module may be included. For a master NUMA node, the pre-read cache module is operable to receive a task description of a plurality of sub-tasks and distribute one sub-task to the pre-read cache module of each slave node. For each slave NUMA node, the pre-read cache module is operable to receive the sub-tasks sent by the pre-read cache module from the master NUMA node. For a main NUMA node, the subtasks to be executed can be cached in a hardware task queue of the main NUMA node, that is, the subtasks acquired from a pre-read cache module of the main NUMA node are queued, and the subtasks are sent to an accelerator task execution module of the main NUMA node. The accelerator task execution module of the primary NUMA node executes the subtasks received from the pre-read cache module. For a slave NUMA node, similar to the master NUMA node, its subtasks to be executed may be cached in the hardware task queue of the slave NUMA node, i.e., the subtasks obtained from the pre-read cache module of the NUMA node are queued and sent to the accelerator task execution module of the slave NUMA node. The accelerator task execution module of the slave NUMA node executes the subtasks received from the pre-read cache module.
In this way, the main NUMA node and each slave NUMA node can simultaneously execute the distributed subtasks, and no cross NUMA data access is needed between the main NUMA node and the slave NUMA node and between the slave NUMA node and the slave NUMA node, so that the high-performance execution of a single operator in a multi-NUMA node scene can be realized.
In some embodiments, the chip scheduling module 503 is further operable to control the slave NUMA node to synchronize the execution state of the subtasks to the master NUMA node, the execution state including start execution, end of execution, and abort.
For example, referring to fig. 8, taking the slave NUMA node 1 as an example, the task module to be executed in the slave NUMA node may be further configured to send the execution state of the subtask to the task module to be executed in the master NUMA node after knowing the execution state of the subtask from the accelerator task execution module. The remaining slave NUMA nodes can send the execution state of the subtasks to the master NUMA node similar to slave NUMA node 1. When the master NUMA node knows the execution state of each slave NUMA node, the execution state of the AI task can be timely reported to the CPU, so that the CPU can timely know the execution condition of the currently issued AI task. The interaction between the CPU and the main NUMA node can be understood as the interaction between the user state software stack and the main NUMA node.
In some embodiments, after the plurality of nodes execute the plurality of subtasks, execution results of the subtasks may be stored in a cache local to the nodes, and the CPU may then actively access the chip system to obtain the task execution results from the cache.
In some embodiments, in the case that there are multiple CPUs with simultaneous computation tasks to issue to the chip system, if a node in the chip system receives multiple subtasks issued by the multiple CPUs at the same time, the node may execute the subtasks issued by the multiple CPUs at the same time. For example, a node in the chip system may execute a concurrency guarantee mechanism to guarantee that when a node receives subtasks issued by multiple CPUs, concurrent execution of multiple subtasks of the multiple CPUs may be achieved. For example, a node may concurrently execute multiple subtasks issued by multiple CPUs through multiple streams (streams).
In some embodiments, when multiple subtasks generated by multiple CPUs access the same address memory, a node closest to the memory address may be selected in the chipset to execute multiple subtasks of multiple CPUs, where multiple subtasks of multiple CPUs are executed serially. Or in some embodiments, when executing the performance evaluation algorithm, the processor may be configured to evaluate operator execution time consumption of a plurality of subtasks of a plurality of CPUs serially executed by a single node, compare operator execution time consumption of the plurality of subtasks concurrently executed by a plurality of nodes and accessed across node memories, and select a manner to execute the plurality of subtasks in an optimal performance. For example, the selectable operator performs multiple subtasks generated by multiple CPUs in a manner that is less time consuming to perform.
In some embodiments, before step 601, the present application may also determine, through the multi-node memory management and task segmentation module 501 (CPU), whether to segment the computing task, that is, whether to execute the computing task on multiple nodes.
Therefore, fig. 9 is a schematic flow chart of a scheduling method of a chip system according to an embodiment of the present application. Prior to this step 601, the following procedure may also be included.
901. The processor determines a second operator execution time consuming when the computing task is performed by the single node, the second operator execution time consuming being a maximum of a data read-write time consuming and an operator computation time consuming when the computing task is performed by the single node.
902. If the execution time of the second operator is greater than or equal to the preset time consumption threshold, when the processor sends the calculation task to the chip system, the calculation task is divided into a plurality of subtasks.
That is, the present application may evaluate the principle of determining whether to perform a computing task cut across nodes according to an operator performance cost (cost), which may be operator execution time-consuming.
In some embodiments, by way of example, the CPU may first evaluate a second operator execution time consuming when the AI task is performed by a single NUMA node, where the second operator execution time consuming is max (data read write time consuming, operator computation time consuming), i.e., taking the maximum of the data read write time consuming and the operator computation time consuming when the AI task is performed by a single NUMA node. This is because the data read-write time and the operator computation time are performed simultaneously, and therefore the maximum of the two can be taken as the time taken by a single NUMA node to end the AI task.
If the execution time of the second operator is greater than or equal to the preset time consumption threshold, the time consumption when the single NUMA node executes the AI task is considered to be too long, the AI task is determined to be segmented, and a plurality of sub-tasks are executed simultaneously by a plurality of NUMA nodes.
FIG. 10 is a schematic diagram illustrating a determination of whether to segment an AI task, where if the execution time of a second operator when executing the AI task by a single NUMA node is less than a predetermined time threshold, the CPU may determine that the number of NUMA nodes is 1, which is equivalent to not segmenting the AI task, may execute a single NUMA node selection policy, and may execute the AI task on the single NUMA node. If the execution time of the second operator is greater than or equal to a preset time consumption threshold, the number of NUMA nodes can be determined to be greater than or equal to 2, which is equivalent to the segmentation of the AI task, and the multi-NUMA node selection strategy can be executed. Specific multi-NUMA node selection policies can be referred to as described in step 601.
In some embodiments, in step 602, for the implementation of mapping logical storage resources to physical storage resources, in order to ensure high utilization and locality of memory, when determining the NUMA nodes of N subtasks to be performed, it may also be determined on which N NUMA nodes to perform the subtasks and apply for memory according to the memory allocation mode. For example, reference may be made to three modes provided under the CPU NUMA architecture, default mode, preferred-new mode, and preferred-disuse mode.
The default mode can be understood as that when a memory is applied and a physical memory address range is acquired, the memory can be applied on an AI chip indicated by a specified NUMA node list. Or by applying for memory on a specified list of dies (die list). If the available memory size of the NUMA node on the specified die list meets the logical storage resources required by each subtask, determining the NUMA node for executing a plurality of subtasks on the specified die list and applying for a corresponding memory address range. If the remaining memory space of the NUMA node on the specified bare chip list is insufficient, namely the remaining memory space is smaller than the memory space size required by the subtasks determined in the logic resources, the memory application fails, and a new die application memory does not need to be searched. At this time, the AI task may be considered to have failed to execute.
The preferred-new mode can be understood as old data on old die and new data on new die. When N NUMA nodes are determined, the application can select a new NUMA node which does not store data to execute N subtasks. For example, for a diagonal die of 4die, a neighbor die with a smaller cost may be selected as a candidate NUMA node for the N NUMA nodes, and memory may be applied on the selected NUMA node.
Dedault, which can be understood as a NUMA node corresponding to the initial die ID selected according to the memory load balancing principle, is used as a candidate NUMA node of the N NUMA nodes, and applies for the memory. If the memory of the NUMA node corresponding to the initial die ID is insufficient, the NUMA node corresponding to the die ID with smaller distance (distance) can be selected as the candidate NUMA nodes of the N NUMA nodes and the memory is applied. The distance can reflect the speed of accessing the NUMA node, and the smaller the distance, the faster the access speed. The NUMA node with smaller distance with the CPU can be selected as the candidate NUMA node of N NUMA nodes and memory can be applied.
Of course, the above three modes are possible implementation manners for determining N NUMA nodes, and the present application is not limited to the manner of determining physical resources.
Therefore, in the chip system, when the computing task is executed, the computing task can be divided into a plurality of subtasks, and logic resources required by the subtasks are determined. In this way, when mapping logical resources to physical resources, and determining physical resources of a plurality of nodes to perform a plurality of sub-tasks, a master node and at least one slave node may be selected in the chip system, and scheduling and distribution of the plurality of sub-tasks may be performed by the master node. In this way, the master node and each slave node need only perform distributed sub-tasks, and cross-node data access is not required when a single operator (a single computing task) is performed. For example, for a NUMA architecture with multiple AI chips, the program programmed by a user on the CPU in the application can use the NUMA architecture with multiple AI chips like a single chip, namely, the user can program in a single chip mode, the usability and compatibility of CPU programming are improved while the execution performance of multiple NUMA nodes is ensured, namely, programming is not perceived while the execution performance of the multiple NUMA nodes is ensured, and the high-performance execution of a single operator under a scene of the multiple NUMA nodes is completed.
Moreover, compared with the prior art that data access is performed across NUMA nodes, the NUMA nodes only have access among the NUMA nodes when state synchronization is performed in the sub-task execution process, and the number of times of access among the NUMA nodes is small. Moreover, the data size for performing data access across NUMA nodes in the prior art is larger than the data size for performing state synchronization in the present application. When state synchronization is performed among NUMA nodes in the application, for example, only 1 bit of data indication subtask is consumed, and the data size for performing data access across NUMA nodes in the prior art is often up to tens of bits, tens of bits to hundreds of bits and the like.
Moreover, in the prior art, the situation that the data addresses of the intermediate results of the AI task execution on one NUMA node or a plurality of NUMA nodes are discontinuous may exist, so that when a CPU accesses data, the CPU accesses data across the NUMA nodes more frequently, and the time for the CPU to access the data sporadically at a time is longer. In the application, after each NUMA node executes the received subtasks and stores the data, the execution result of the subtasks is only stored on the node, the data addresses are continuous, and when the CPU performs single scattered access, the CPU can access to obtain data on one node, and the time consumption of the single scattered access is short. And the state synchronization is only carried out between the master NUMA node and the slave NUMA node, so that the problem of frequent data access across NUMA nodes can be avoided.
It will be appreciated that in order to achieve the above-described functionality, the electronic device comprises corresponding hardware and/or software modules that perform the respective functionality. The present application can be implemented in hardware or a combination of hardware and computer software, in conjunction with the example algorithm steps described in connection with the embodiments disclosed herein. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application in conjunction with the embodiments, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The present embodiment may divide the functional modules of the electronic device according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules described above may be implemented in hardware. It should be noted that, in this embodiment, the division of the modules is schematic, only one logic function is divided, and another division manner may be implemented in actual implementation.
In the case of dividing the respective functional modules with the respective functions, fig. 11 shows a schematic diagram of one possible composition of the electronic device 110 involved in the above-described embodiment, and as shown in fig. 11, the electronic device 110 may include a task segmentation unit 1101, a resource mapping unit 1102, and a task scheduling unit 1103. The task segmentation unit 1101 may have the same functions as the multi-NUMA memory management and task segmentation module 501, the resource mapping unit 1102 may have the same functions as the runtime task distribution module 502, and the task scheduling unit 1103 may have the same functions as the chip scheduling module 503. The functions of the task segmentation unit 1101 and the resource mapping unit 1102 may be performed by a CPU and the functions of the chip scheduling module 503 may be performed by a master NUMA node.
Wherein the task segmentation unit 1101 may be configured to support the electronic device 110 to perform the above-described steps 601, 901, 902, etc., and/or other processes for the techniques described herein.
The resource mapping unit 1102 may be used to support the electronic device 110 to perform step 602, etc., described above, and/or other processes for the techniques described herein.
The task scheduling unit 1103 may be used to support the electronic device 110 to perform the above-described step 603, etc., and/or other processes for the techniques described herein.
It should be noted that, all relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.
The electronic device 110 provided in this embodiment is configured to execute the scheduling method of the chip system, so that the same effect as the implementation method can be achieved.
In the case of an integrated unit, the electronic device 110 may include a processing module, a storage module, and a communication module. The processing module may be configured to control and manage the actions of the electronic device 110, for example, may be configured to support the electronic device 110 to perform the steps performed by the task segmentation unit 1101, the resource mapping unit 1102, and the task scheduling unit 1103. The memory module may be used to support the electronic device 1100 in storing program code, data, and the like. A communication module, which may be used to support communication of the electronic device 110 with other devices, such as with wireless access devices.
Wherein the processing module may be a processor or a controller. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. A processor may also be a combination that performs computing functions, e.g., including one or more microprocessors, digital Signal Processing (DSP) and a combination of microprocessors, and the like. The memory module may be a memory. The communication module can be a radio frequency circuit, a Bluetooth chip, a Wi-Fi chip and other equipment which interact with other electronic equipment.
In one embodiment, when the processing module is a processor, the storage module is a memory, and the communication module is a transceiver, the electronic device related to this embodiment may be the electronic device 120 having the structure shown in fig. 12.
The embodiment of the application also provides electronic equipment which comprises one or more processors and one or more memories. The one or more memories are coupled to the one or more processors, the one or more memories being configured to store computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the related method steps described above to implement the scheduling method of the chip system in the above-described embodiments.
Embodiments of the present application also provide a computer storage medium having stored therein computer instructions which, when executed on an electronic device, cause the electronic device to perform the above-described related method steps to implement the method for scheduling a chip system in the above-described embodiments.
Embodiments of the present application also provide a computer program product, which when run on a computer, causes the computer to perform the above-mentioned related steps to implement the scheduling method of the chip system executed by the electronic device in the above-mentioned embodiments.
In addition, the embodiment of the application also provides a device which can be a chip, a component or a module, and the device can comprise a processor and a memory which are connected, wherein the memory is used for storing computer execution instructions, and when the device runs, the processor can execute the computer execution instructions stored in the memory so that the chip executes the scheduling method of the chip system executed by the electronic equipment in the method embodiments.
The electronic device, the computer storage medium, the computer program product, or the chip provided in this embodiment are used to execute the corresponding methods provided above, so that the beneficial effects thereof can be referred to the beneficial effects in the corresponding methods provided above, and will not be described herein.
It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and the parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the method described in the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.