CN119917293A

CN119917293A - A service matching method, electronic device, program product and storage medium

Info

Publication number: CN119917293A
Application number: CN202510408497.9A
Authority: CN
Inventors: 王文潇; 陈培; 王德奎
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2025-04-02
Filing date: 2025-04-02
Publication date: 2025-05-02

Abstract

The present application provides a service matching method, an electronic device, a program product and a storage medium, which relate to the field of service load balancing. The method comprises: receiving an inference request sent to a model inference service; determining the request length of the inference request, and matching a target container group for the inference request according to the request length and the amount of hardware resources corresponding to the container group providing the model inference service; the model inference service is provided by at least two container groups; sending the inference request to the target container group, so that the target container group provides the model inference service according to the inference request; matching a suitable target container group for the inference request according to the request length and the amount of hardware resources corresponding to the container group providing the model inference service, and being able to consider the performance loss of the request length for the model inference service, thereby being able to achieve a better load balancing effect.

Description

Service matching method, electronic device, program product and storage medium

Technical Field

The present application relates to the field of service load balancing, and in particular, to a service matching method, an electronic device, a program product, and a storage medium.

Background

Along with the continuous development of artificial intelligence technology, a pre-training language model is widely applied. To support application deployment of the pre-trained language model, the pre-trained language model may be deployed in multiple container groups (Pod) using a container orchestration platform, and model reasoning services are provided out of the container groups in common. However, the load balancing effect provided by the container orchestration platform for the container group providing the model reasoning service is poor, and when a huge amount of reasoning requests are processed, the container group is easy to generate load unbalance, and the service quality is affected.

Disclosure of Invention

The application aims to provide a service matching method, equipment, a program product and a storage medium, which can match a proper target container group for an inference request according to the request length and the hardware resource quantity corresponding to the container group for providing model inference service, and can consider the performance loss of the request length to the model inference service, thereby achieving a better load balancing effect.

In order to solve the above technical problems, the present application provides a service matching method, including:

receiving an reasoning request sent to a model reasoning service;

determining the request length of the reasoning request, and matching a target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing model reasoning service;

the inference request is issued to the target set of containers to provide a model inference service from the inference request by the target set of containers.

Optionally, matching the target container group for the reasoning request according to the request length and the hardware resource amount corresponding to the container group for providing the model reasoning service, including:

Judging whether the request length is larger than a preset length;

if the request length is greater than the preset length, determining the predicted resource consumption corresponding to the request length;

And matching the target container group in the container groups meeting the predicted resource consumption according to the hardware resource remaining amount corresponding to the container groups.

Optionally, matching the target container group among container groups satisfying the predicted resource consumption amount includes:

Taking the container group meeting the predicted resource consumption as a candidate container group;

Setting selection weights for the candidate container groups according to the hardware resource residual amounts of the candidate container groups;

the target container set is matched for the inference request in the candidate container set according to the selection weights.

Optionally, the method further comprises:

searching a historical reasoning request belonging to the same dialogue with the reasoning request according to dialogue identification information in the reasoning request;

determining a predicted resource consumption corresponding to the request length includes:

And determining the predicted resource consumption according to the request length of the reasoning request and the request length of the historical reasoning request.

Optionally, after determining whether the request length is greater than the preset length, the method further includes:

if the request length is not greater than the preset length, the container group corresponding to the client is taken as a target container group according to the client to which the inference request belongs, or the container group with the minimum request processing number is taken as the target container group according to the current request processing number of the container group.

Optionally, determining the request length of the inference request includes:

Reading the request length from the content length field of the reasoning request;

or determining the message length of the request message containing the reasoning request, and taking the message length as the request length.

Optionally, the model reasoning service includes a service component including an external service component, an internal service component, and a request buffer;

receiving an inference request sent to a model inference service, comprising:

controlling an external service component to receive an inference request;

Judging whether the container group is in a general slow running state according to the corresponding overall performance value of the container group;

if the container group is in the overall slow running state, controlling the external service component to issue an inference request to the request buffer, and controlling the internal service component to acquire the inference request from the request buffer;

issuing the reasoning request to the target set of containers, comprising:

and issuing an reasoning request in the external service component or an reasoning request in the internal service component to the target container group.

Optionally, the model reasoning service further comprises a service grid;

determining the request length of the reasoning request, and matching a target container group for the reasoning request according to the request length and the hardware resource amount corresponding to the container group for providing model reasoning service, wherein the method comprises the following steps:

If the container group is not in the overall slow running state, the control service grid executes the steps of determining the request length of the reasoning request in the external service component, and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing model reasoning service;

And if the container group is in the overall slow running state, the control service grid executes the steps of determining the request length of the reasoning request in the internal service component, and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing model reasoning service.

Optionally, the method further comprises:

If the container group is not in the overall slow running state, adding the domain name of the external service component into the load balancing configuration of the service grid, controlling the service grid to execute the steps of determining the request length of the reasoning request in the external service component according to the load balancing configuration, and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group providing the model reasoning service;

And if the container group is in the overall slow running state, adding the domain name of the internal service component into the load balancing configuration of the service grid, controlling the service grid to execute the steps of determining the request length of the reasoning request in the external service component according to the load balancing configuration, and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing the model reasoning service.

Optionally, before receiving the reasoning request sent to the model reasoning service, the method further comprises:

receiving a model reasoning service creation request, and determining user-defined resources required by the model reasoning service according to the model reasoning service creation request;

The control resource controller analyzes the self-defined resources into container native resources to create a container group, an external service component and an internal service component corresponding to the model reasoning service, and marks version information for the container group;

the control resource controller creates a service grid for the model reasoning service and adds a load balancing configuration for the service grid.

Optionally, the method further comprises:

receiving a model reasoning service update request, and determining the latest custom resources required by the model reasoning service according to the model reasoning service update request;

the control resource controller converts the latest customized resource into a container original resource to obtain a new version container group, an external service component and an internal service component of the model reasoning service, and marks new version information for the new version container group;

The control resource controller creates a new version of service grid for the model reasoning service, and adds load balancing configuration for the new version of service grid;

The find model infers the container groups, external service components, and internal service components of the old version of the service and deletes.

Optionally, adding a load balancing configuration to the service grid includes:

and the control resource controller reads the global load balancing configuration from the preset configuration file and adds the global load balancing configuration to the service grid.

Optionally, the method further comprises:

the control strategy controller receives a load balancing configuration updating request, updates a preset configuration file by using the load balancing configuration updating request, and updates the load balancing configuration in the service grid of each model reasoning service by using the preset configuration file.

The present application also provides an electronic device including:

A memory for storing a computer program;

and the processor is used for realizing the service matching method when executing the computer program.

The application also provides a computer program product comprising a computer program or instructions which when executed by a processor implement the service matching method described above.

The application also provides a nonvolatile computer readable storage medium, wherein the nonvolatile computer readable storage medium stores computer executable instructions, and when the computer executable instructions are loaded and executed by a processor, the service matching method is realized.

The application has the advantages that when receiving the reasoning request sent to the model reasoning service, the application can determine the request length of the reasoning request, and can match the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group providing the model reasoning service provided by at least two container groups. This is because the length of the reasoning request is closely related to the consumption of the hardware resources by the model reasoning service, and when the text input by the user through the reasoning request is long, the model reasoning service needs to consume a large amount of hardware resources to analyze the input text of the user, and meanwhile, the model reasoning service also tends to generate a long output text, so that a large amount of hardware resources are also needed to be consumed to store the output text. Therefore, the application can match the proper target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing the model reasoning service, thereby achieving better load balancing effect. The application also provides an electronic device, a computer program product and a computer readable storage medium, which have the beneficial effects.

Drawings

For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a flowchart of a service matching method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an inference service provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of another reasoning service provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an update of an inference service version according to an embodiment of the present application;

FIG. 5 is an update schematic diagram of a global load balancing configuration according to an embodiment of the present application;

Fig. 6 is a block diagram of a service matching device according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.

It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.

Along with the continuous development of artificial intelligence technology, a pre-training language model is widely applied. The pre-training language model refers to a neural network model which is trained by a large amount of text data in advance and is used for natural language processing, and can be used for language dialogue, document summarization, article expansion and the like. To support application deployment of the pre-trained language model, the pre-trained language model may be deployed in multiple container groups (Pod) using a container orchestration platform, and model reasoning services are provided out of the container groups in common. Wherein the Container orchestration platform may create a Container group (Pod) in the device Node (Node), create a Container (Container) in the Container group, and deploy a model reasoning service in the Container. At this time, the device node may provide the container group with the underlying hardware computing power, such as providing the container group with a processor, memory, graphics card, etc. For the same model reasoning service, the container orchestration platform may create groups of containers in different device nodes to support the operation of the model reasoning service through the different device nodes, and the hardware configuration of the different device nodes may be different.

In the related art, although the container arrangement platform can provide a load balancing mechanism for a plurality of container groups, the load balancing effect obtained by the model reasoning service is poor, so that the load imbalance of the container groups is easy to occur, and the service quality is influenced. The method is characterized in that a common load balancing mechanism mainly considers the number of the accessed requests of each container group and does not consider the length of the accessed requests of each container group, and meanwhile, the performance of the model reasoning service is influenced by the number of the accessed requests and the length of the accessed requests, unlike the common network service. When the user inputs a longer request, the model reasoning service needs to consume a large amount of hardware resources to perform input analysis and output generation and needs to consume a large amount of processing time, and when the same container group processes a large amount of longer reasoning requests at the same time, the hardware resources are rapidly consumed, so that the situation of unbalanced load occurs.

In view of this, the present application can provide a service matching method for improving the load balancing effect of a container group in a model reasoning service scenario, which can match a proper target container group for a reasoning request according to a request length and the hardware resource amount corresponding to the container group providing the model reasoning service, and can consider the performance loss of the request length to the model reasoning service, thereby achieving a better load balancing effect.

The service matching method provided by the application is described below. It should be noted that the method is performed by a container orchestration platform. The embodiment is not limited to the hardware device for deploying the container arranging platform, and can be set according to actual application requirements, for example, can be deployed in a personal computer or a server. In particular, the container orchestration platform may be deployed in a server cluster, and may use multiple server devices as device nodes of a deployable container group to improve overall processing performance.

For easy understanding, please refer to fig. 1, fig. 1 is a flowchart of a service matching method according to an embodiment of the present application, where the method may include:

S100, receiving an reasoning request sent to the model reasoning service.

In the present embodiment, the inference request refers to a request inputted by the user to request the model inference service to perform model inference. The request may contain input data for the user, which may be in the form of text, files, etc. It should be noted that the request length of the inference request is affected by the text length or the file size, i.e. the longer the text entered by the user or the larger the file volume entered by the user, the longer the request length of the inference request.

S200, determining the request length of the reasoning request, and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing the model reasoning service, wherein the model reasoning service is provided by at least two container groups.

In this embodiment, the model inference service is provided by at least two container groups, which may be deployed on the same device node or on different device nodes, and different device nodes may have different hardware configurations. Considering that the model reasoning service has different hardware resource consumption and time consumption when processing reasoning requests with different lengths, particularly when processing longer reasoning requests, the model reasoning service needs to consume a large amount of hardware resources to analyze user input, tends to generate longer output text, needs to consume a large amount of hardware resources to generate and store the output text, and needs to consume longer time to process. Therefore, in order to achieve a better load balancing effect, the embodiment can determine the request length of the reasoning request, and match the target container group for the reasoning request in the container groups according to the request length and the hardware resource amount corresponding to the container group providing the model reasoning service, so as to ensure that the hardware resource amount of the target container group can match the hardware resource consumption amount of the reasoning request and ensure that the target container group can rapidly process the reasoning request.

The amount of hardware resources may be the amount of hardware resources native to the container group or the amount of hardware resources remaining in the container group. The present embodiment is not limited to a specific amount of hardware resources, and may be an amount of memory, an amount of display, an occupancy rate of a processor, or the like.

In particular, the present embodiment can distinguish between longer and shorter reasoning requests, considering that model reasoning services typically consume a large amount of hardware resources and generate longer processing time when dealing with longer reasoning requests, while consuming less hardware resources and allowing fast processing when dealing with shorter user inputs. First, the embodiment can distinguish a longer reasoning request from a shorter reasoning request by judging whether the request length of the reasoning request is greater than a preset length. If the request length is greater than the preset length, the embodiment needs to determine the predicted resource consumption corresponding to the request length. Wherein the predicted resource consumption may be determined based on a historical resource consumption consumed by the model inference service in the past to process the inference request for the request length, e.g., taking the most frequently occurring historical resource consumption as the predicted resource consumption. Then, the present embodiment may determine the remaining amount of hardware resources corresponding to each container group, and determine whether the remaining amount of hardware resources can satisfy the predicted resource consumption amount, so as to select a target container group from the container groups satisfying the demand.

Based on this, matching the target container group for the inference request according to the request length and the hardware resource amount corresponding to the container group providing the model inference service may include:

And step 11, judging whether the request length is larger than a preset length.

And step 12, if the request length is greater than the preset length, determining the predicted resource consumption corresponding to the request length.

And step 13, matching the target container group in the container groups meeting the predicted resource consumption according to the hardware resource remaining amount corresponding to the container groups.

It should be noted that, given the large amount of context information that is generated during the user's dialogue with the inference model, the inference model needs to refer to this context information when processing the current inference request, and the longer context information also affects the performance of the model inference service. Therefore, the length of the context information may also be considered when determining the predicted resource consumption amount. Therefore, the embodiment can search the historical reasoning request which belongs to the same dialogue with the current reasoning request, and estimate the length of the context information according to the historical reasoning request. That is, in determining the predicted resource consumption amount, the present embodiment may consider the request length of the history inference request belonging to the same dialogue as the current inference request, in addition to the request length of the current inference request. Therefore, when the predicted resource consumption is determined, the embodiment can be closer to the actual working mode of the reasoning model, and further a better load balancing effect can be achieved.

Based on this, the method may further include:

step 21, searching the historical reasoning request belonging to the same dialogue with the reasoning request according to the dialogue identification information in the reasoning request.

Determining the predicted resource consumption corresponding to the request length may include:

and step 31, determining the predicted resource consumption according to the request length of the reasoning request and the request length of the historical reasoning request.

Further, it is contemplated that there may be a plurality of container groups satisfying the predicted resource consumption amount, and the remaining amounts of hardware resources corresponding to the plurality of container groups may be different, and thus the corresponding performances may be different. On the basis of ensuring allocation fairness, in order to allocate container groups with more hardware resource remaining amounts as possible, the embodiment may use container groups meeting the predicted resource consumption amount as candidate container groups, and set selection weights for each candidate container group according to the hardware resource remaining amounts of each candidate container group. The probability that the candidate container group is selected as the target container group is also in positive correlation with the selection weight, namely, the more the selection weight is, the easier the candidate container group is selected as the target container group. In this way, the embodiment can improve the probability of selecting the container group with larger hardware resource residual quantity, thereby achieving better load balancing effect.

Based on this, matching the target container group among the container groups satisfying the predicted resource consumption amount may include:

And step 41, taking the container group meeting the predicted resource consumption as a candidate container group.

And 42, setting selection weights for the candidate container groups according to the hardware resource residual amounts of the candidate container groups, wherein the selection weights and the hardware resource residual amounts are in positive correlation.

Step 43, matching the target container group for the reasoning request in the candidate container group according to the selection weight.

It can be appreciated that the selection weights can be dynamically adjusted according to the real-time variation of the remaining amount of hardware resources. For example, one possible selection weight configuration is as follows:

- match:

- headers:

content-length:

when the length of the 1000 # text is more than or equal to 1000 characters, the following weight is set.

route:

- destination:

pod: llm-high-resource-pods

Weight 70 the weight of the container group with more 70 # resources remaining.

- route:

- destination:

pod: llm-normal-pods

Weight 30 the weight of the container group with more resources remaining in weight 30.

Further, if the request length is smaller than the preset length, considering that the shorter reasoning request generally consumes less hardware resources for model reasoning service, in order to improve the scheduling efficiency, scheduling can be performed according to the client to which the reasoning request belongs, or according to the current corresponding request processing number of each container group. When dispatching is carried out according to the client side to which the reasoning request belongs, a large amount of context information is considered to be generated in the process of carrying out dialogue between the user and the reasoning model, and then the context information needs to be referred to when the current reasoning request is processed by the reasoning model. In order to balance the number of request processes of each container group when scheduling according to the current number of request processes of the container group, the container group having the smallest number of request processes may be taken as a target container group.

Based on this, after determining whether the request length is greater than the preset length, it may further include:

And step 51, if the request length is not greater than the preset length, taking the container group corresponding to the client as a target container group according to the client to which the inference request belongs, or taking the container group with the minimum request processing number as the target container group according to the current request processing number of the container group.

Further, the embodiment is not limited to how to determine the request length of the reasoning request, and may be set according to the actual application requirement. For example, the request length may be read from a content length field (content-length) of the inference request, or a message length of a request message including the inference request may be determined and used as the request length.

Based on this, determining the request length of the inference request may include:

Step 61, reading the request length from the content length field of the reasoning request, or determining the message length of the request message containing the reasoning request, and taking the message length as the request length.

S300, issuing the reasoning request to the target container group so that the target container group can provide model reasoning service according to the reasoning request.

In this embodiment, after matching of the target container group is completed, an inference request may be issued to the target container group, so that the target container group provides a model inference service according to the inference request.

Based on the embodiment, when receiving the reasoning request sent to the model reasoning service, the application can determine the request length of the reasoning request, and match the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group providing the model reasoning service, wherein the model reasoning service is provided by at least two container groups. This is because the length of the reasoning request is closely related to the consumption of the hardware resources by the model reasoning service, and when the text input by the user through the reasoning request is long, the model reasoning service needs to consume a large amount of hardware resources to analyze the input text of the user, and meanwhile, the model reasoning service also tends to generate a long output text, so that a large amount of hardware resources are also needed to be consumed to store the output text. Therefore, the application can match the proper target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing the model reasoning service, thereby achieving better load balancing effect.

Based on the above embodiment, in order to achieve a better load balancing effect, and avoid the breakdown of the model reasoning service due to overload under high concurrency, the present embodiment may further improve the entry of the reasoning request. Based on this, the model reasoning service can include service components including external service components, internal service components, and request buffers. Receiving an inference request sent to a model inference service may include:

s101, controlling an external service component to receive an reasoning request.

In this embodiment, the external service component is a unified portal for reasoning requests, which can provide external domain names for accessing the model reasoning services, and the client can send the reasoning requests to the external service component by accessing such external domain names.

S102, judging whether the container group is in an overall slow running state according to the overall performance value corresponding to the container group.

And S103, if the container group is in the overall slow running state, controlling the external service component to send the reasoning request to the request buffer, and controlling the internal service component to acquire the reasoning request from the request buffer.

Issuing the inference request to the target set of containers may include:

S301, issuing an reasoning request in the external service component or the reasoning request in the internal service component to a target container group.

In steps S102 and S103, to avoid that the high concurrency lower model reasoning service crashes due to overload, a request buffer and internal service components may be further introduced. The request buffer is used to temporarily store excess speculative requests, and may include a wait queue for storing speculative requests. The internal service component is a proxy for the external service component, which stores the communication information of each container group for directly interfacing the container groups. The internal service component can dynamically fetch new reasoning requests from the request buffer according to the processing condition of the container group and issue the new reasoning requests to the container group for processing. It can be seen that by introducing a request buffer and internal service components, excessive reasoning requests can be staged, thus alleviating high concurrency in the container group.

Furthermore, the transmission path of the reasoning request is prolonged due to the introduction of the request buffer and the internal service component, and the processing efficiency of the container group is easy to be affected under the condition of low concurrency pressure, so that the embodiment can judge whether the container group is in an overall slow running state according to the corresponding overall performance value of the container group. The overall slow running state indicates that all container groups providing model reasoning service have slow running conditions with slow processing speed and long response time, namely high concurrency conditions. Thus, in steps S102, S103, and S301, if it is determined that the container group is in the overall slow running state, the external service component is controlled to issue an inference request to the request buffer, and the internal service component is controlled to acquire the inference request from the request buffer, so as to issue the inference request in the internal service component to the container group. If it is determined that the container group is not in an overall slow running state, an inference request may be issued directly from the external service component to the container group.

The present embodiment is not limited to a specific overall performance value, and may be, for example, the overall memory occupation amount of each container group, the overall request response time, and the like. If the total video memory occupation amount of the container group is larger than the preset threshold value or the total request response time is larger than the preset duration, the container group can be judged to be in the total slow running state.

Further, in order to improve the execution effect of load balancing, the embodiment may introduce a service grid into the model reasoning service, which is specifically used for executing the load balancing operation (i.e. step S200). The service grid can realize intelligent flow routing, load balancing, fault recovery and fusing mechanisms, so that the stability and efficiency of the service are remarkably improved. For example, through the flow dividing function of the service grid, the request can be distributed to different versions of reasoning service examples in proportion, the gray level release and A/B test are supported, and through the fusing and retry mechanism, the problem example can be automatically isolated when the rear-end service fails, and the request can be retried, so that the service avalanche is avoided. In addition, the service grid also provides a rich observability tool, can monitor flow, delay and error rate in real time, and helps operation and maintenance team to quickly locate and solve the problems. It should be noted that, since the present embodiment particularly adjusts the entry of the inference request, so that the inference request needs to reach the container group through the external service component, or needs to reach the container group through the external service component, the request buffer, and the internal service component, the location where the service grid performs load balancing needs to be dynamically adjusted. If the container group is not in the overall slow running state, the service grid needs to be controlled to perform load balancing in the external service component, and if the container group is in the overall slow running state, the service grid needs to be controlled to perform load balancing in the internal service component.

Based on the above, the model reasoning service also comprises a service grid, and the method for determining the request length of the reasoning request and matching the target container group for the reasoning request according to the request length and the hardware resource amount corresponding to the container group for providing the model reasoning service can comprise the following steps:

S201, if the container group is not in the overall slow running state, the control service grid executes the steps of determining the request length of the reasoning request in the external service component, and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing the model reasoning service.

Specifically, the domain name of the external service component may be added to the load balancing configuration of the service grid, so that the service grid may automatically perform load balancing in the external service component according to the load balancing configuration.

Based on this, step S201 may include:

And step 71, if the container group is not in the overall slow running state, adding the domain name of the external service component into the load balancing configuration of the service grid, controlling the service grid to execute the steps of determining the request length of the reasoning request in the external service component according to the load balancing configuration, and matching the target container group for the reasoning request according to the request length and the hardware resource amount corresponding to the container group for providing the model reasoning service.

For easy understanding, please refer to fig. 2, fig. 2 is a schematic diagram of an inference service provided in an embodiment of the present application. When the container group is not in an overall slow running state, the inference request may be issued directly from the external service component to the container group. At this point, the communication information for each container group may be copied from the internal service component to the external service component such that the external service component directs the service endpoint to the container group. When the load balancing configuration is directed to an external service component, the service grid performs load balancing processing in the external service component.

S202, if the container group is in a general slow running state, the control service grid executes the steps of determining the request length of the reasoning request in the internal service component, and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing model reasoning service.

Specifically, the domain name of the internal service component may be added to the load balancing configuration of the service grid, so that the service grid may automatically perform load balancing in the internal service component according to the load balancing configuration.

Based on this, step S202 may include:

And 81, if the container group is in the overall slow running state, adding the domain name of the internal service component into the load balancing configuration of the service grid, controlling the service grid to execute the steps of determining the request length of the reasoning request in the external service component according to the load balancing configuration, and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing the model reasoning service.

It is worth noting that for longer reasoning requests, there are two scenarios in this embodiment of sending requests to the final target container group, (1) load balancing from the external service component to the target container group, (2) forwarding from the request buffer to the internal service component, and load balancing through the internal service component to the target container group. It is necessary to configure the text capture logic of the service grid for both the internal service components and the external service components in the face of the special scenario of longer reasoning requests. Both of these scenarios can achieve load balancing of longer reasoning requests.

For ease of understanding, please refer to fig. 3, fig. 3 is a schematic diagram of another reasoning service provided by an embodiment of the present application, which illustrates the connection relationship and data transfer relationship of the external service component, the request buffer, the internal service component, and the container group. When the container group is in an overall slow running state, the service endpoint of the external service component points to a request buffer, from which the request is forwarded to the internal service component. When the load balancing configuration points to the internal service component, the service grid performs load balancing processing in the internal service component.

Based on the above embodiments, a detailed description will be given below of a method for creating and updating a container group, an external service component, an internal service component, and a service grid. In one possible scenario, before receiving the reasoning request sent to the model reasoning service, it may further comprise:

And 91, receiving a model reasoning service creation request, and determining the custom resources required by the model reasoning service according to the model reasoning service creation request.

Step 92, the control resource controller analyzes the customized resource into a container native resource to create a container group, an external service component and an internal service component corresponding to the model reasoning service, and marks version information for the container group.

In this embodiment, to facilitate creation of the model inference service, software resources and hardware resources required by the inference service may be abstracted into custom resources, and a resource controller may be set, and the custom resources may be converted into container native resources, such as container groups and service components, by using the resource controller. In this way, the embodiment can conveniently create the container group and the service component according to the required user-defined resources, so that the convenience of creating the model reasoning service can be improved.

Further, for convenience of version management, after the creation of the container group is completed, the container group may be marked with version information indicating the version of the container group (model inference service).

And 93, controlling the resource controller to create a service grid for the model reasoning service and adding load balancing configuration for the service grid.

In this step, after the creation of the container group and the service component is completed, the resource controller may be further controlled to create a service grid for the model inference service, and create a load balancing configuration for the service grid, so as to provide a load balancing function for the model inference service.

In one possible case, the method may further comprise:

Step 1001, receiving a model reasoning service update request, and determining the latest custom resources required by the model reasoning service according to the model reasoning service update request;

Step 1002, controlling a resource controller to convert the latest customized resource into a container original resource, obtaining a new version container group, an external service component and an internal service component of a model reasoning service, and marking new version information for the new version container group;

Step 1003, controlling a resource controller to create a new version of service grid for the model reasoning service, and adding load balancing configuration for the new version of service grid;

Step 1004, find the old version of the container group, external service components, and internal service components of the model reasoning service and delete.

In this embodiment, when the model inference service is updated with resources, the resource controller will be controlled to recreate the container group, service component, and create a service grid for the recreated container group and service component according to the latest custom resource requirements. The old version container set, the external service component and the internal service component corresponding to the model reasoning service are required to be searched, and the old version is deleted. For easy understanding, please refer to fig. 4, fig. 4 is a schematic diagram illustrating an update of an inference service version according to an embodiment of the present application. It can be seen that after the new version of the model inference service is created, the old version of the model inference service can be deleted. By setting the above-mentioned update mode of the reasoning service, the embodiment can realize reliable upgrade of the reasoning service.

Therefore, the embodiment realizes the full life cycle automatic management from creation, update to deletion of the reasoning service through the self-defined resource logic and the matched resource controller, greatly simplifies the operation and maintenance flow and reduces the labor cost.

Furthermore, in order to improve the management efficiency of the load balancing configuration, in this embodiment, a configuration file is set in the container arrangement platform, and the global load balancing configuration is saved by using the configuration file. When the service grid is created, the resource controller may be controlled to read the global load balancing configuration from the configuration file and set up for the service grid.

Based on this, adding a load balancing configuration to the service grid may include:

Step 1101, the resource controller is controlled to read the global load balancing configuration from the preset configuration file, and the global load balancing configuration is added to the service grid.

Specifically, the global load balancing configuration may be saved in configmap files.

In addition, when the global load balancing configuration is updated, in order to timely apply the update to the service grid of each model reasoning service, a policy controller can be further configured in the embodiment to monitor and receive the load balancing configuration update request. When the load balancing configuration update request is received, the policy controller may update the preset configuration file with the load balancing configuration update request, and update the load balancing configuration in the service grid of each model reasoning service with the preset configuration file.

Based on this, the method may further include:

step 1201, the control policy controller receives a load balancing configuration update request, updates a preset configuration file by using the load balancing configuration update request, and updates the load balancing configuration in the service grid of each model reasoning service by using the preset configuration file.

For convenience of understanding, please refer to fig. 5, fig. 5 is a schematic diagram illustrating an update of a global load balancing configuration according to an embodiment of the present application. When the policy controller detects that the global load balancing configuration is updated, the policy controller may globally update the load balancing configuration of each inference service, for example, synchronize the latest load balancing configuration to each model inference service in a broadcast manner.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.

The embodiment of the application also provides a service matching device. Referring to fig. 6, fig. 6 is a block diagram of a service matching device according to an embodiment of the present application, where the service matching device may include:

A receiving module 601, configured to receive an inference request sent to a model inference service;

the load balancing module 602 is configured to determine a request length of an inference request, and match a target container group for the inference request according to the request length and a hardware resource amount corresponding to a container group that provides a model inference service;

A request issuing module 603, configured to issue an inference request to the target container group, so that the target container group provides a model inference service according to the inference request.

Optionally, the load balancing module 602 may include:

the judging submodule is used for judging whether the request length is larger than the preset length or not;

The consumption determining submodule is used for determining the predicted resource consumption corresponding to the request length if the request length is greater than the preset length;

And the first matching submodule is used for matching the target container group in the container group meeting the predicted resource consumption according to the hardware resource remaining amount corresponding to the container group.

Optionally, the first matching sub-module may include:

a candidate group setting unit configured to set a container group satisfying the predicted resource consumption amount as a candidate container group;

The weight setting unit is used for setting selection weights for the candidate container groups according to the hardware resource residual amounts of the candidate container groups;

And the matching setting unit is used for matching the target container group for the reasoning request in the candidate container groups according to the selection weight.

Optionally, the load balancing module 602 may further include:

The inquiry sub-module is used for searching historical reasoning requests belonging to the same dialogue with the reasoning requests according to the dialogue identification information in the reasoning requests;

The consumption determination submodule may be used to:

Optionally, the load balancing module 602 may further include:

And the first matching sub-module is used for taking the container group corresponding to the client as a target container group according to the client to which the reasoning request belongs or taking the container group with the minimum request processing number as the target container group according to the current request processing number of the container group if the request length is not greater than the preset length.

Optionally, the load balancing module 602 may include:

The request length determining submodule is used for reading the request length from the content length field of the reasoning request, or determining the message length of a request message containing the reasoning request, and taking the message length as the request length.

the receiving module 601 may include:

The first control sub-module is used for controlling the external service component to receive the reasoning request;

The performance detection sub-module is used for judging whether the container group is in an overall slow running state according to the overall performance value corresponding to the container group;

The second control sub-module is used for controlling the external service component to send the reasoning request to the request buffer and controlling the internal service component to acquire the reasoning request from the request buffer if the container group is in the overall slow running state;

the request issuing module 603 may be configured to:

The inference request in the external service component or the inference request in the internal service component is issued to the target container group.

Optionally, the model reasoning service further comprises a service grid;

the load balancing module 602 may include:

The second control sub-module is used for controlling the service grid to execute the step of determining the request length of the reasoning request in the external service component and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing the model reasoning service if the container group is not in the overall slow running state;

and the third control sub-module is used for controlling the service grid to execute the steps of determining the request length of the reasoning request in the internal service component and matching the target container group for the reasoning request according to the request length and the hardware resource quantity corresponding to the container group for providing the model reasoning service if the container group is in the overall slow running state.

Optionally, the second control submodule may be configured to add the domain name of the external service component to the load balancing configuration of the service grid if the container group is not in the overall slow running state, and control the service grid to execute the steps of determining the request length of the reasoning request in the external service component according to the load balancing configuration, and matching the target container group for the reasoning request according to the request length and the hardware resource amount corresponding to the container group providing the model reasoning service;

And the third control sub-module can be used for adding the domain name of the internal service component into the load balancing configuration of the service grid if the container group is in the overall slow running state, controlling the service grid to execute the steps of determining the request length of the reasoning request in the external service component according to the load balancing configuration, and matching the target container group for the reasoning request according to the request length and the hardware resource amount corresponding to the container group providing the model reasoning service.

Optionally, the apparatus may further include:

The creation request receiving module is used for receiving the model reasoning service creation request and determining the custom resources required by the model reasoning service according to the model reasoning service creation request;

The container group creation module is used for controlling the resource controller to analyze the self-defined resources into container original resources so as to create a container group, an external service component and an internal service component corresponding to the model reasoning service and mark version information for the container group;

And the service grid creation module is used for controlling the resource controller to create a service grid for the model reasoning service and adding load balancing configuration for the service grid.

Optionally, the apparatus may further include:

The update request receiving module is used for receiving the update request of the model reasoning service and determining the latest custom resources required by the model reasoning service according to the update request of the model reasoning service;

The container group updating module is used for controlling the resource controller to convert the latest customized resource into a container original resource, obtaining a new version container group, external service and internal service of the model reasoning service, and marking new version information for the new version container group;

The service grid updating module is used for controlling the resource controller to create a new version of service grid for the model reasoning service and adding load balancing configuration for the new version of service grid;

And the deleting module is used for searching and deleting the container group, the external service component and the internal service component of the old version of the model reasoning service.

Optionally, the service grid creation module may include:

the load balancing configuration reading sub-module is used for controlling the resource controller to read the global load balancing configuration from the preset configuration file and adding the global load balancing configuration to the service grid.

Optionally, the apparatus may further include:

The load balancing configuration updating module is used for controlling the policy controller to receive a load balancing configuration updating request, updating a preset configuration file by using the load balancing configuration updating request, and updating the load balancing configuration in the service grid of each model reasoning service by using the preset configuration file.

The description of the features in the embodiment corresponding to the service matching device may refer to the related description of the embodiment corresponding to the service matching method, which is not described in detail herein.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the service matching method embodiments described above.

Referring to fig. 7, fig. 7 is a block diagram of an electronic device according to an embodiment of the present application, and the embodiment of the present application provides an electronic device 10, including a processor 11 and a memory 12, where the memory 12 is used for storing a computer program, and the processor 11 is used for executing the service matching method provided in the foregoing embodiment when executing the computer program.

For the specific process of the service matching method, reference may be made to the corresponding content provided in the foregoing embodiment, and no further description is given here.

The memory 12 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the storage may be a temporary storage or a permanent storage.

In addition, the electronic device 10 further includes a power supply 13, a communication interface 14, an input/output interface 15, and a communication bus 16, where the power supply 13 is configured to provide a working voltage for each hardware device on the electronic device 10, the communication interface 14 is capable of creating a data transmission channel between the electronic device 10 and an external device, and a communication protocol to be followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein, and the input/output interface 15 is configured to obtain external input data or output data to the external device, and a specific interface type of the input/output interface may be selected according to a specific application need and is not specifically limited herein.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the service matching method embodiments described above when run.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.

Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the service matching method embodiments described above.

Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which when executed by a processor implements the steps of any of the service matching method embodiments described above.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The service matching method, the device, the electronic equipment, the program product and the storage medium provided by the application are described in detail. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that the present application may be modified and practiced without departing from the spirit of the present application.

Claims

1. A service matching method, comprising:

receiving an inference request sent to a model inference service;

Determine the request length of the inference request, and match the target container group for the inference request according to the request length and the amount of hardware resources corresponding to the container group providing the model inference service; the model inference service is provided by at least two container groups;

The inference request is sent to the target container group, so that the target container group provides the model inference service according to the inference request.

2. The service matching method according to claim 1, characterized in that matching the target container group for the inference request according to the request length and the amount of hardware resources corresponding to the container group providing the model inference service comprises:

Determine whether the request length is greater than a preset length;

According to the remaining amount of hardware resources corresponding to the container group, the target container group is matched in the container group that meets the predicted resource consumption.

3. The service matching method according to claim 2, wherein matching the target container group in the container group that satisfies the predicted resource consumption comprises:

taking a container group that meets the predicted resource consumption as a candidate container group;

According to the remaining amount of hardware resources of each candidate container group, a selection weight is set for each candidate container group; the selection weight is positively correlated with the remaining amount of hardware resources;

A target container group is matched for the inference request in the candidate container group according to the selection weight.

4. The service matching method according to claim 2, characterized in that after determining whether the request length is greater than a preset length, it further comprises:

If the request length is not greater than the preset length, the container group corresponding to the client to which the inference request belongs is used as the target container group, or the container group with the minimum request processing number is used as the target container group according to the current request processing number of the container group.

5. The service matching method according to claim 1, wherein determining the request length of the inference request comprises:

Reading the request length from the content length field of the inference request;

Or, determine the message length of the request message containing the inference request, and use the message length as the request length.

6. The service matching method according to any one of claims 1 to 5, characterized in that the model reasoning service comprises a service component, and the service component comprises an external service component, an internal service component and a request buffer;

Receive an inference request sent to the model inference service, including:

Controlling the external service component to receive the inference request;

Judging, according to the overall performance value corresponding to the container group, whether the container group is in an overall slow running state;

If the container group is in the overall slow running state, controlling the external service component to send the inference request to the request buffer, and controlling the internal service component to obtain the inference request from the request buffer;

Sending the inference request to the target container group includes:

The inference request in the external service component or the inference request in the internal service component is sent to the target container group.

7. The service matching method according to claim 6, characterized in that the model reasoning service further comprises a service grid;

Determining a request length of the inference request, and matching a target container group for the inference request according to the request length and an amount of hardware resources corresponding to a container group providing the model inference service, including:

If the container group is not in the overall slow running state, controlling the service grid to execute the steps of determining the request length of the inference request in the external service component, and matching the target container group for the inference request according to the request length and the amount of hardware resources corresponding to the container group providing the model inference service;

If the container group is in the overall slow running state, the service grid is controlled to execute the steps of determining the request length of the inference request in the internal service component, and matching the target container group for the inference request according to the request length and the amount of hardware resources corresponding to the container group providing the model inference service.

8. The service matching method according to claim 7, further comprising:

If the container group is not in the overall slow running state, the domain name of the external service component is added to the load balancing configuration of the service grid, and the service grid is controlled to perform the steps of determining the request length of the inference request in the external service component according to the load balancing configuration, and matching the target container group for the inference request according to the request length and the amount of hardware resources corresponding to the container group providing the model inference service;

If the container group is in the overall slow running state, the domain name of the internal service component is added to the load balancing configuration of the service grid, and the service grid is controlled to execute the steps of determining the request length of the inference request in the external service component according to the load balancing configuration, and matching the target container group for the inference request according to the request length and the amount of hardware resources corresponding to the container group providing the model inference service.

9. The service matching method according to claim 8, characterized in that before receiving the inference request sent to the model inference service, it also includes:

Receive a model reasoning service creation request, and determine custom resources required for the model reasoning service according to the model reasoning service creation request;

Controlling the resource controller to parse the custom resource into a container native resource to create a container group, an external service component, and an internal service component corresponding to the model reasoning service, and marking version information for the container group;

Control the resource controller to create a service grid for the model reasoning service, and add a load balancing configuration to the service grid.

10. The service matching method according to claim 9, further comprising:

Receiving a model reasoning service update request, and determining the latest custom resource required by the model reasoning service according to the model reasoning service update request;

Control the resource controller to convert the latest custom resource into a container native resource, obtain a new version of the container group, external service components, and internal service components of the model reasoning service, and mark the new version information for the new version of the container group;

Control the resource controller to create a new version of the service grid for the model reasoning service, and add a load balancing configuration for the new version of the service grid;

Find and delete the old version of the container group, external service components, and internal service components of the model inference service.

11. The service matching method according to claim 9, wherein adding a load balancing configuration to the service grid comprises:

Control the resource controller to read the global load balancing configuration from a preset configuration file, and add the global load balancing configuration to the service grid.

12. The service matching method according to claim 11, further comprising:

The control strategy controller receives a load balancing configuration update request, uses the load balancing configuration update request to update the preset configuration file, and uses the preset configuration file to update the load balancing configuration in the service grid of each model inference service.

13. An electronic device, comprising:

Memory for storing computer programs;

A processor, configured to implement the service matching method according to any one of claims 1 to 12 when executing the computer program.

14. A computer program product, comprising a computer program or an instruction, wherein when the computer program or the instruction is executed by a processor, the service matching method according to any one of claims 1 to 12 is implemented.

15. A non-volatile computer-readable storage medium, characterized in that the non-volatile computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are loaded and executed by a processor, the service matching method according to any one of claims 1 to 12 is implemented.