Disclosure of Invention
In view of this, the embodiments of the present disclosure provide a GPU space-time sharing method for deep learning services, which at least partially solves the problem in the prior art that scheduling efficiency and instantaneity are poor.
In a first aspect, an embodiment of the present disclosure provides a GPU space-time sharing method for deep learning services, including:
step 1, reading a DNN model list of GPU resources to be allocated in a system initialization state, wherein each element of the list is a triplet, and each element comprises a DNN model, the residual load size and the delay requirement;
step 2, estimating GPU computing resource demand for each DNN model according to the load size of each DNN model in the DNN model list in sequence;
step 3, ordering DNN models in the DNN model list from large to small according to the GPU computing resource demand, and sequentially distributing GPU computing resources for each DNN model from time dimension and space dimension in a fine granularity mode according to the ordering result;
step 4, updating the residual load of each DNN model in the DNN model list according to the GPU resource allocation result, and removing the DNN model with the residual load of 0 from the DNN model list;
and 5, detecting whether the DNN model list is empty, if so, ending the resource allocation flow of the GPU space-time sharing, and if not, executing the step 2.
According to a specific implementation manner of the embodiment of the present disclosure, the step 2 specifically includes:
step 2.1, setting a variable i with an initial value of 1;
step 2.2, reading an ith element in the DNN model list, and acquiring saturated throughput when DNN models corresponding to the ith element are independently deployed on a single GPU and distributed with GPU partitions of different sizes;
step 2.3, if the residual load size of the DNN model corresponding to the ith element is not smaller than the saturation throughput corresponding to the most efficient partition, marking the DNN model corresponding to the ith element as a saturation load, otherwise, finding out the smallest GPU partition with the batch size b meeting a preset formula from 10% to the most efficient partition size, and marking the DNN model corresponding to the ith element as a non-saturation load;
and 2.4, executing i=i+1, judging whether i is greater than the number of DNN models in the DNN model list, if so, ending the estimation flow of the GPU computing resource demand, and if not, executing the step 2.2.
According to a specific implementation manner of the embodiment of the disclosure, the preset formula is
Wherein SLO is the delay SLO of the DNN model corresponding to the ith element, l (b, p) is the reasoning delay of the DNN model corresponding to the ith element when the GPU partition with the size of p is independently deployed and the batch size of b is the reasoning delay.
According to a specific implementation manner of the embodiment of the present disclosure, the step 3 specifically includes:
step 3.1, setting a variable i with an initial value of 1;
step 3.2, reading the ith element in the sorted DNN model list, if the DNN model corresponding to the ith element is marked as a saturated load, executing the step 3.3, otherwise, executing the step 3.4;
step 3.3, selecting a GPU partition meeting the first preset requirement from the existing GPU partitions of the system, if the GPU partition meeting the first preset requirement exists, calculating the maximum batch size meeting the delay requirement of the DNN model corresponding to the ith element, distributing GPU computing resources for the DNN model corresponding to the ith element in a fine granularity mode, executing step 3.6, otherwise adding a GPU for the system, and then re-executing step 3.3;
step 3.4, selecting a GPU partition meeting a second preset requirement from the existing GPU partitions of the system, if the GPU partition meeting the second preset requirement exists, calculating the maximum batch size meeting the DNN model delay requirement corresponding to the ith element, distributing GPU computing resources for the DNN model corresponding to the ith element in a fine granularity mode, and executing step 3.6, otherwise, executing step 3.5;
step 3.5, selecting a GPU partition meeting a third preset requirement from the existing GPU partitions of the system, if the GPU partition meeting the third preset requirement exists, calculating the maximum batch size meeting the delay requirement of the DNN model corresponding to the ith element, distributing GPU computing resources for the DNN model corresponding to the ith element in a fine granularity mode, executing step 3.6, otherwise adding a GPU for the system, and then re-executing step 3.5;
and 3.6, executing i=i+1, judging whether i is greater than the number of DNN models in the ordered DNN model list, if so, ending the dynamic allocation flow of GPU resources, otherwise, executing the step 3.2.
According to a specific implementation manner of the embodiment of the present disclosure, the first preset requirement includes: the partition is not allocated with workload and is not smaller than the GPU computing resource demand of the DNN model corresponding to the ith element, and the GPU partition which accords with the GPU computing resource demand of the DNN model corresponding to the ith element is segmented from the GPU partition, so that the delay requirement of the DNN model corresponding to the ith element can be met, and the throughput of other loads on the same GPU is not lost;
the second preset requirements include: the partition is allocated with workload, the size of the partition is not smaller than the GPU computing resource demand of the DNN model corresponding to the ith element, and the DNN model corresponding to the ith element is allocated to the GPU partition, so that the delay requirements of all reasoning requests can be met;
the third preset requirement includes: the partition is not allocated with workload, the size of the partition is not smaller than the GPU computing resource demand of the DNN model corresponding to the ith element, and the GPU partition which accords with the GPU computing resource demand of the DNN model corresponding to the ith element is segmented from the GPU partition, so that the delay requirement of the DNN model corresponding to the ith element can be met, and the throughput of other loads on the same GPU is not lost.
The GPU space-time sharing scheme for the deep learning service in the embodiment of the disclosure comprises the following steps: step 1, reading a DNN model list of GPU resources to be allocated in a system initialization state, wherein each element of the list is a triplet, and each element comprises a DNN model, the residual load size and the delay requirement; step 2, estimating GPU computing resource demand for each DNN model according to the load size of each DNN model in the DNN model list in sequence; step 3, ordering DNN models in the DNN model list from large to small according to the GPU computing resource demand, and sequentially distributing GPU computing resources for each DNN model from time dimension and space dimension in a fine granularity mode according to the ordering result; step 4, updating the residual load of each DNN model in the DNN model list according to the GPU resource allocation result, and removing the DNN model with the residual load of 0 from the DNN model list; and 5, detecting whether the DNN model list is empty, if so, ending the resource allocation flow of the GPU space-time sharing, and if not, executing the step 2.
The beneficial effects of the embodiment of the disclosure are that: according to the scheme, the GPU resource is allocated to the multi-model instance by predicting the performance of DNN reasoning tasks under the diversified parallel condition; based on diversified factors such as request load, request delay SLO and the like of each model instance, the demand of DNN tasks on GPU resources is estimated, GPU computing resources are dynamically and flexibly distributed in a fine granularity mode from time and space dimensions, and the utilization of the GPU resources is maximized on the premise of guaranteeing deep learning service quality.
Detailed Description
Embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Other advantages and effects of the present disclosure will become readily apparent to those skilled in the art from the following disclosure, which describes embodiments of the present disclosure by way of specific examples. It will be apparent that the described embodiments are merely some, but not all embodiments of the present disclosure. The disclosure may be embodied or practiced in other different specific embodiments, and details within the subject specification may be modified or changed from various points of view and applications without departing from the spirit of the disclosure. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concepts of the disclosure by way of illustration, and only the components related to the disclosure are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.
With the rapid development of deep learning, a deep neural network (Deep Neural Networks, abbreviated as DNN) model has been widely used in various fields. The reasoning computation task based on DNN model often requires consuming a large amount of computing resources, so that the resource-constrained device cannot support the deep learning application in a low-latency, high-accuracy manner. For this reason, a large number of cloud service providers deploy DNN inference calculations as a service to a cloud data center (simply referred to as a deep learning service system), so that devices with limited computing resources can also support deep learning applications through the service. To meet the real-time needs of users, cloud service providers typically deploy deep learning service systems to clusters of large-scale image processors (Graphics Processing Unit, GPUs for short). Each time the system receives an inference request from a user, an inference calculation is performed in the requested DNN model, and a corresponding prediction result is returned.
Ideally, the deep learning service system should meet key metrics such as low latency, low cost, and high throughput. However, in a practical production scenario, a single DNN reasoning request often fails to fully utilize the expensive GPU computing resources. In order to improve the utilization rate of GPU resources, a mainstream deep learning service system adopts a batch processing strategy to process the reasoning requests sent from the same model in batches, and after the reasoning results of all the requests are obtained, result data are returned uniformly. The method can effectively improve the GPU resource utilization rate and the overall throughput of the system, but can prolong the response delay of a single reasoning request. Since online reasoning tasks of DNN models often have a strict latency SLO (Service Level Objective ), i.e. the reasoning requests have to get a response result within a certain time frame, the latching strategy increases the risk of violating the reasoning request latency SLO.
In addition, GPU sharing is another effective method for improving the GPU resource utilization, that is, deploying multiple DNN models on the same GPU server. Existing research efforts have mainly employed time-division-shared, space-shared, and space-time-shared policies to manage and schedule multiple DNN model instances deployed on the same GPU. Given input data, the completion time of the DNN tasks when running independently on the GPU has deterministic characteristics. The GPU only runs a single DNN task in any scheduling unit under the time division sharing strategy, so that predictability of DNN task execution time is reserved naturally, and scheduling flow setting for meeting delay SLO is simplified. However, DNN tasks supported by the time sharing policy run independently on the GPU in turn, and the parallel execution characteristics of the GPU kernel are not fully utilized, and there is still a problem that the GPU resource utilization rate is low. The spatial sharing technology represented by the inflowing MPS (Multi Process Service, multi-process service), CUDA (Compute Unified Device Architecture, unified computing device architecture) stream, MIG (Multi Instance GPU, multiple GPU instance) and the like supports the GPU to run multiple tasks at any time, and each task uses only a part of GPU computing resources, which is helpful for improving the GPU resource utilization of the deep learning service system. However, under the spatial sharing strategy, the concurrently running tasks suffer from different degrees of performance loss, which presents challenges to the stability of the inference request response time.
The GPU space-time sharing strategy fully combines the characteristics of time division sharing and space sharing, so that the GPU can support the running of a plurality of tasks at any time, and GPU resources are distributed to different DNN model examples in the time dimension. Compared with a single time division sharing strategy and a single space sharing strategy, the GPU space-time sharing strategy is more beneficial to improving the utilization rate of GPU resources and reducing the hardware cost of a system. However, current GPU space-time sharing research towards deep learning service systems still faces the following challenges:
1) The execution time of DNN tasks depends on performance disturbances from other tasks executing in parallel. However, in GPU space-time sharing mode, DNN tasks that are executed in parallel overlap arbitrarily in the time dimension, which increases the uncertainty of the start and end times of the tasks, i.e. the combination of tasks executed in parallel on the GPU varies greatly from time to time. Because the interference degree of different parallel task combinations is often different, the deep learning service system is difficult to predict the execution time of each DNN task in the GPU space-time sharing mode, and therefore the real-time requirement of a user is difficult to ensure.
2) The requirement of the DNN model on GPU resources dynamically changes along with the indexes such as the size of a batch processing request, the delay SLO of the request and the like. These dynamic factors increase the difficulty of utilizing the space-time sharing mode to increase the utilization of GPU resources. The current GPU space-time sharing method facing the deep learning service cannot dynamically and flexibly divide GPU space according to the resource requirement of DNN tasks. In addition, the real-time demands of users and the utilization rate of GPU resources are required to be balanced, so that the complexity of GPU resource management and scheduling is increased.
The embodiment of the disclosure provides a GPU space-time sharing method for deep learning service, which can be applied to a GPU resource scheduling process of a cloud service scene.
Referring to fig. 1, a flowchart of a GPU space-time sharing method for deep learning services is provided in an embodiment of the present disclosure. As shown in fig. 1, the method mainly comprises the following steps:
step 1, reading a DNN model list of GPU resources to be allocated in a system initialization state, wherein each element of the list is a triplet, and each element comprises a DNN model, the residual load size and the delay requirement;
in the implementation, as shown in fig. 2, the GPU space-time sharing frame diagram for the deep learning service is mainly composed of four parts, namely a performance analysis module, an interference perception delay predictor, a global scheduler and a back-end executor, wherein the four parts interact with each other to jointly complete the resource allocation flow of GPU space-time sharing. The global scheduler is responsible for receiving reasoning requests from users, and estimating the GPU resource demand of the DNN model according to the load condition of each DNN model instance, namely the number of reasoning requests per second. The performance analysis module collects performance data of each DNN model under diversified parallel execution conditions, provides the data to the interference perception delay predictor, and calculates parallel reasoning delay of the DNN model. Parallel reasoning latency refers to the latency when multiple DNN models perform reasoning computations on the GPU at the same time. As shown in FIG. 3, where a-d represent different DNN model instances, the GPU is divided into three parts, i.e., three GPU partitions. Each partition runs DNN inference computation tasks at the same time, so tasks on different partitions can run simultaneously at the same time, i.e., space sharing GPU. And when a plurality of DNN reasoning tasks are executed on the GPU at the same time, the time delay is the parallel reasoning time delay. The DNN tasks executed in parallel are overlapped in a random cross manner in the time dimension, namely, the task combinations executed simultaneously on the GPU are quite different in different time periods, and the mutual interference degree between different tasks is quite different. The interference-aware delay predictor establishes a performance prediction model for DNN tasks according to the sampling data, and provides guidance for a delay SLO-aware GPU partitioning and scheduling strategy. The delay SLO perceived GPU dividing and scheduling module distributes GPU computing resources for each DNN model according to the GPU resource demand of each DNN model and the information provided by the delay predictor, and sends the guiding information to the back-end executor. And the back-end executor places each DNN model to the GPU meeting the requirements according to the guiding information, and allocates corresponding quantity of GPU computing resources to the related models. In addition, the resource allocation procedure of GPU space-time sharing is re-executed whenever the global scheduler monitors a significant change in the overall load.
The resource allocation flow of GPU space-time sharing is as follows:
the system initialization state comprises a GPU, and reads to-be-dividedDNN model list models for GPU resources, each element of the list being a triplet<model i ,r i ,slo i >Wherein the model i Representing DNN model instance corresponding to ith element in list, r i Is a model i The remaining load size of (1), i.e., the current monitored per second, is sent to the model i Number of inference requests, slo i Is a model i Corresponding delayed SLO.
Step 2, estimating GPU computing resource demand for each DNN model according to the load size of each DNN model in the DNN model list in sequence;
on the basis of the above embodiment, the step 2 specifically includes:
step 2.1, setting a variable i with an initial value of 1;
step 2.2, reading an ith element in the DNN model list, and acquiring saturated throughput when DNN models corresponding to the ith element are independently deployed on a single GPU and distributed with GPU partitions of different sizes;
step 2.3, if the residual load size of the DNN model corresponding to the ith element is not smaller than the saturation throughput corresponding to the most efficient partition, marking the DNN model corresponding to the ith element as a saturation load, otherwise, finding out the smallest GPU partition with the batch size b meeting a preset formula from 10% to the most efficient partition size, and marking the DNN model corresponding to the ith element as a non-saturation load;
and 2.4, executing i=i+1, judging whether i is greater than the number of DNN models in the DNN model list, if so, ending the estimation flow of the GPU computing resource demand, and if not, executing the step 2.2.
Further, the preset formula is that
Wherein SLO is the delay SLO of the DNN model corresponding to the ith element, l (b, p) is the reasoning delay of the DNN model corresponding to the ith element when the GPU partition with the size of p is independently deployed and the batch size of b is the reasoning delay.
In the implementation, the GPU computing resource demand is estimated for each DNN model in the list models according to the load size of the DNN models in sequence, and the method comprises the following specific steps:
2.1, setting a variable i with an initial value of 1 for indexing elements in list models;
2.2, reading the ith element in list models<model i ,r i ,slo i >Obtaining model i Saturation throughput when deployed independently on a single GPU and assigned GPU partitions of different sizes; wherein, the GPU partition refers to a GPU sub-part, not the whole GPU, and the size of the GPU partition is the percentage of the allocated computing resource amount to the whole GPU computing resource amount; GPU partitioning can be realized by MPS, MIG and other technologies; for example, MPS may be utilized to allocate 30% of the GPU computing resources to the reasoning tasks of a certain DNN model, i.e., to 30% size GPU partitions of the DNN model; test model i The saturated throughput on a given input data and GPU partition is achieved by first finding the maximum batch size b that satisfies the following formula:
2l(b,p)≤slo
wherein slo is a model i Is a model i The GPU partition with the size p is independently deployed, and the reasoning is delayed when the batch size b is achieved. After finding the maximum batch size b meeting the condition, calculating b/l (b, p) to obtain the model i Saturated throughput on GPU partition of size p. By this method, model is obtained sequentially i Saturated throughput on GPU partitions of 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100% size. Wherein, the GPU partition with the maximum saturated throughput rate is the model i Is a most efficient GPU partition. FIG. 4 is a schematic diagram of the most efficient GPU partition acquisition of the Googlete model, with the abscissa representing the GPU partition size and the ordinate representing the saturation throughput of the Googlete model measured on a corresponding sized Injedak A100 GPU partition, with 20% size GPU partition being 2.13 times the saturation throughput increase rate relative to 10% size GPU partition, and with the saturation throughput increase rate being slower and slower after greater than 20%, so that 20% size GPU partition is GoogleThe most efficient GPU partitioning of Net model.
2.3, if model i Residual load size r i The saturated throughput corresponding to the most efficient partition is not smaller, and the most efficient GPU partition is the model i Calculates the resource demand by the GPU of the model i Marked as saturated load; otherwise, from 10% to the most efficient partition size, find the smallest GPU partition with a batch size b that satisfies the formula, i.e., model i Calculates the resource demand by the GPU of the model i Marked as unsaturated load:
wherein slo is a model i Is a model i The GPU partition with the size p is independently deployed, and the reasoning is delayed when the batch size b is achieved.
2.4, executing i=i+1, judging whether i is greater than the DNN model number in the list models, and if so, ending the estimation flow of the GPU computing resource demand; otherwise, step 2.2 is performed.
Step 3, ordering DNN models in the DNN model list from large to small according to the GPU computing resource demand, and sequentially distributing GPU computing resources for each DNN model from time dimension and space dimension in a fine granularity mode according to the ordering result;
on the basis of the above embodiment, the step 3 specifically includes:
step 3.1, setting a variable i with an initial value of 1;
step 3.2, reading the ith element in the sorted DNN model list, if the DNN model corresponding to the ith element is marked as a saturated load, executing the step 3.3, otherwise, executing the step 3.4;
step 3.3, selecting a GPU partition meeting the first preset requirement from the existing GPU partitions of the system, if the GPU partition meeting the first preset requirement exists, calculating the maximum batch size meeting the delay requirement of the DNN model corresponding to the ith element, distributing GPU computing resources for the DNN model corresponding to the ith element in a fine granularity mode, executing step 3.6, otherwise adding a GPU for the system, and then re-executing step 3.3;
step 3.4, selecting a GPU partition meeting a second preset requirement from the existing GPU partitions of the system, if the GPU partition meeting the second preset requirement exists, calculating the maximum batch size meeting the DNN model delay requirement corresponding to the ith element, distributing GPU computing resources for the DNN model corresponding to the ith element in a fine granularity mode, and executing step 3.6, otherwise, executing step 3.5;
step 3.5, selecting a GPU partition meeting a third preset requirement from the existing GPU partitions of the system, if the GPU partition meeting the third preset requirement exists, calculating the maximum batch size meeting the delay requirement of the DNN model corresponding to the ith element, distributing GPU computing resources for the DNN model corresponding to the ith element in a fine granularity mode, executing step 3.6, otherwise adding a GPU for the system, and then re-executing step 3.5;
and 3.6, executing i=i+1, judging whether i is greater than the number of DNN models in the ordered DNN model list, if so, ending the dynamic allocation flow of GPU resources, otherwise, executing the step 3.2.
Further, the first preset requirement includes: the partition is not allocated with workload and is not smaller than the GPU computing resource demand of the DNN model corresponding to the ith element, and the GPU partition which accords with the GPU computing resource demand of the DNN model corresponding to the ith element is segmented from the GPU partition, so that the delay requirement of the DNN model corresponding to the ith element can be met, and the throughput of other loads on the same GPU is not lost;
the second preset requirements include: the partition is allocated with workload, the size of the partition is not smaller than the GPU computing resource demand of the DNN model corresponding to the ith element, and the DNN model corresponding to the ith element is allocated to the GPU partition, so that the delay requirements of all reasoning requests can be met;
the third preset requirement includes: the partition is not allocated with workload, the size of the partition is not smaller than the GPU computing resource demand of the DNN model corresponding to the ith element, and the GPU partition which accords with the GPU computing resource demand of the DNN model corresponding to the ith element is segmented from the GPU partition, so that the delay requirement of the DNN model corresponding to the ith element can be met, and the throughput of other loads on the same GPU is not lost.
In specific implementation, ordering DNN models in the list models from large to small according to the estimated GPU computing resource demand, and sequentially distributing GPU computing resources for each DNN model in a fine granularity mode according to the ordering result, wherein the specific steps are as follows:
3.1, setting a variable i with an initial value of 1, and indexing elements in the ordered list models;
3.2, reading the ith element in list models<model i ,r i ,slo i >The method comprises the steps of carrying out a first treatment on the surface of the If model i Marked as saturated load, step 3.3 is performed; otherwise, executing the step 3.4;
3.3, selecting GPU partition meeting the following requirements from the existing GPU partitions of the system: 1) The partition is not assigned a workload and is not smaller than the model i The GPU of (1) calculates the resource demand; 2) If just conforming to the model is partitioned from the GPU partition i GPU partition of GPU computing resource demand and can meet model i And does not lose throughput for other loads on the same GPU. If GPU partitions meeting the two-point requirements exist, calculating the maximum batch size b meeting the requirements: 2l (b, p) is less than or equal to slo, wherein slo is a model i Is a model i Parallel reasoning delay when the GPU partition with the size of p is deployed and the batch size of b is deployed, wherein p is a model i Is used to calculate the resource demand and then model i The distribution result of the current round is set as<l i ,b i >Step 3.6 is performed, wherein b i For the maximum batch size, l, of the above requirements i B is i Corresponding parallel reasoning delay; otherwise, adding a GPU for the system, and then re-executing the step 3.3;
3.4, selecting GPU partitions meeting the following requirements from the existing GPU partitions of the system: 1) The partition has been assigned a workload and has a size not smaller than the model i The GPU of (1) calculates the resource demand; 2) Model is put into i And the delayed SLO of all reasoning requests can be met by being distributed to the GPU partition. If a GPU partition meeting the two-point requirements exists, assuming that the scheduling period of the GPU partition is d, calculating the maximum batch size b meeting the requirements: d+l (b, p) is less than or equal to slo, wherein slo is a model i Is a model i Parallel reasoning delay when the GPU with the size of p is deployed and the batch size of b is deployed, and p is model i Is used to calculate the resource demand and then model i The distribution result of the current round is set as<l i ,b i >Step 3.6 is performed, wherein b i For the maximum batch size, l, of the above requirements i Equal to d; otherwise, step 3.5 is performed.
3.5, selecting GPU partition meeting the following requirements from the existing GPU partitions of the system: 1) The partition is not assigned a workload and has a size not smaller than the model i The GPU of (1) calculates the resource demand; 2) If just conforming to the model is partitioned from the GPU partition i GPU partition of GPU computing resource demand and can meet model i And does not lose throughput for other loads on the same GPU. If GPU partitions meeting the two-point requirements exist, calculating the maximum batch size b meeting the requirements: wherein slo is a model i Is a model i Parallel reasoning delay when the GPU partition with the size of p is deployed and the batch size of b is deployed, wherein p is a model i GPU computing resource demand, r i Is a model i Is then used to model i The distribution result of the current round is set as<l i ,b i >Step 3.6 is performed, wherein b i For the maximum batch size, l, of the above requirements i Equal to->Otherwise, adding a GPU for the system, and then re-executing the step 3.5;
3.6, executing i=i+1, judging whether i is greater than the DNN model number in the list models, and if so, ending the dynamic allocation flow of GPU resources; otherwise, step 3.2 is performed.
Step 4, updating the residual load of each DNN model in the DNN model list according to the GPU resource allocation result, and removing the DNN model with the residual load of 0 from the DNN model list;
in the implementation, according to the GPU resource allocation result, updating the residual load size of each DNN model in the list models; assuming a model i The corresponding distribution result is<l i ,b i >Its residual load size r i Is updated asIf model i Corresponding residual load size r i 0, model is then i Delete from list models.
And 5, detecting whether the DNN model list is empty, if so, ending the resource allocation flow of the GPU space-time sharing, and if not, executing the step 2.
In specific implementation, whether the DNN model list is empty can be detected in real time, if yes, the resource scheduling can be considered to be completed, the resource allocation flow of GPU space-time sharing can be ended, if no, the workload of GPU resources to be allocated is considered to exist, and the step 2 can be returned to continue calculating the resource demand.
According to the GPU space-time sharing method for the deep learning service, provided by the embodiment, the performance of DNN reasoning tasks under the diversified parallel condition is predicted, so that guidance is provided for distributing GPU resources for multi-model examples; based on diversified factors such as request load, request delay SLO and the like of each model instance, the demand of DNN tasks on GPU resources is estimated, GPU computing resources are dynamically and flexibly distributed in a fine granularity mode from time and space dimensions, and the utilization of the GPU resources is maximized on the premise of guaranteeing deep learning service quality.
The GPU space-time sharing method for the deep learning service provided by the embodiment of the disclosure can realize dynamic division and scheduling of GPU resources on the premise of guaranteeing the delayed SLO requirement of DNN reasoning tasks, and the strategy has the characteristics of high efficiency and high expandability, and can minimize the hardware cost of the deep learning service system. The beneficial effects are as follows: 1) The real-time requirement of the deep learning service system in the GPU space-time sharing mode can be ensured by predicting the performance of DNN reasoning tasks under diversified parallel conditions; 2) By dynamically dividing and distributing GPU resources in a fine granularity mode from two dimensions of time and space based on resource requirements of workload, the utilization rate of the GPU resources is improved, and a more flexible solution is provided for resource management of a deep learning service system.
FIG. 5 is a graph showing the performance of the present invention compared to the prior art on a server consisting of 8 Zhang Yingwei to A100 GPU. The system configuration is as follows:
TABLE 1
The Gpull is the most advanced GPU space-time sharing scheduling system internationally. Compared with the Gpull, the invention can use fewer GPUs on the premise of meeting the delay SLO of more than 99 percent of requests.
It should be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof.
The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the disclosure are intended to be covered by the protection scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.