CN119415273A

CN119415273A - Reasoning service management method, device, medium and computer program product

Info

Publication number: CN119415273A
Application number: CN202510025516.XA
Authority: CN
Inventors: 王德奎; 陈培
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2025-01-08
Filing date: 2025-01-08
Publication date: 2025-02-11

Abstract

The invention relates to the technical field of computers and discloses an inference service management method, device, medium and computer program product, comprising the steps of receiving a language model dialogue and forwarding the dialogue to corresponding inference service copies by utilizing a load balancer; when the throughput data exceeds the set throughput threshold, the capacity expansion notification signal is sent to the reasoning service resource scheduler through the reasoning service expansion component so that the reasoning service resource scheduler performs reasoning service expansion to obtain capacity expansion copies, and the address of the capacity expansion copies is configured by the load balancer. In this way, throughput in the language model service is used as a basis for carrying out service capacity expansion and load balancing selection, and capacity expansion of the language model reasoning service is triggered when the throughput exceeds a threshold value, so that resources are optimally allocated and used, resource utilization rate can be improved, and user experience can be improved.

Description

Inference service management method, device, medium and computer program product

Technical Field

The present invention relates to the field of computer technology, and in particular, to a method, apparatus, medium, and computer program product for managing inference services.

Background

The reasoning service of the language model requires more resources of a graphics processor (Graphics Processing Unit, GPU), and a single host can only deploy one service, so that the dialog requests of a large number of users can not be met under the high concurrency scene of the internet. Meanwhile, the data center generally accumulates GPU server resources with different specifications, and language model services need to be run on the GPU servers with different specifications based on an asset multiplexing principle. In the related technical scheme, multiple service copies of the same application have the same processor memory specification, the service copies have the same weight when in elastic expansion and contraction, the load balancing strategies such as polling and minimum connection numbers are all forwarded based on the same load capacity provided by different services and are not applicable to the service scene of the language model, the elastic expansion strategies are mostly based on simple resource utilization indexes, the special performance indexes of the language model reasoning service are not considered, and the expansion of the language model reasoning service cannot be realized.

Disclosure of Invention

The invention aims to provide an inference service management method, device, medium and computer program product, which are suitable for a language model service scene and can realize the expansion of the language model inference service, so as to optimally allocate and use resources, improve the utilization rate of the resources and perfect user experience.

In order to solve the technical problems, the invention provides an inference service management method, which is used for an inference service platform comprising a load balancer, an inference service telescoping component and an inference service resource scheduler, and comprises the following steps:

Receiving a language model dialogue, and forwarding the language model dialogue to a corresponding reasoning service copy by utilizing the load balancer;

When the language model reasoning service runs, the reasoning service expansion component is utilized to count throughput data of the reasoning service copy in a set period;

When the counted throughput data exceeds a set throughput threshold, sending a capacity expansion notification signal to the reasoning service resource scheduler through the reasoning service expansion component so as to enable the reasoning service resource scheduler to perform reasoning service expansion and obtain an expansion copy;

and configuring the address of the capacity-expanded copy by using the load balancer.

In a first aspect, in the above method for managing an inference service provided by the present invention, receiving a language model dialogue, forwarding the language model dialogue to a corresponding inference service copy by using the load balancer, including:

Receiving a first language model dialog carrying a target dialog identifier;

determining a corresponding reasoning service copy according to the target dialogue identifier;

Forwarding the first language model dialogue to the determined inference service replica using the load balancer.

On the other hand, in the above-mentioned reasoning service management method provided by the present invention, before receiving the first language model dialogue carrying the target dialogue identifier, the method further includes:

establishing an association relation between the dialogue identifier and the reasoning service copy;

Determining a corresponding reasoning service copy according to the target dialogue identifier, including:

And determining the reasoning service copy associated with the target dialogue identifier by combining the target dialogue identifier according to the established association relation between the dialogue identifier and the reasoning service copy.

On the other hand, in the above reasoning service management method provided by the present invention, establishing an association relationship between a dialogue identifier and a reasoning service copy includes:

when a language model dialogue is created for the first time, a corresponding dialogue identifier is randomly generated;

inquiring throughput real-time data of the inference service copy by using the load balancer, and selecting the inference service copy with the minimum throughput;

the relationship between the randomly generated dialogue identifier and the address of the selected inference service copy is recorded to establish an association of the dialogue identifier with the inference service copy.

On the other hand, the reasoning service management method provided by the invention further comprises the following steps:

And if the throughput of the determined reasoning service copy reaches the maximum value or the reasoning service copy does not exist, forwarding the first language model dialogue to other reasoning service copies by using the load balancer, and updating the association relation between the dialogue identifier and the reasoning service copy.

On the other hand, in the above-mentioned reasoning service management method provided by the present invention, receiving a language model dialogue, forwarding the language model dialogue to a corresponding reasoning service copy by using the load balancer, including:

receiving a second language model dialog that does not carry a dialog identifier;

selecting an inference service copy with the smallest ratio between real-time throughput and maximum throughput by using the load balancer;

forwarding the second language model dialogue to the selected inference service replica.

When the counted throughput data is zero, sending a capacity reduction notification signal to the reasoning service resource scheduler through the reasoning service expansion component so that the reasoning service resource scheduler performs reasoning service capacity reduction to obtain capacity reduction copies;

and shifting out the address of the scaled copy by using the load balancer.

On the other hand, in the above-mentioned reasoning service management method provided by the present invention, before receiving the language model dialogue, it further includes:

when nodes with various specifications exist in the cluster, carrying out throughput performance test on the language model service, and counting the maximum throughput performance data of the language model service on the nodes with different specifications;

different set throughput thresholds are defined for different inference service copies of the same language model service according to the statistical maximum throughput performance data.

And selecting the resource specification meeting the set condition to deploy the language model reasoning service by utilizing the reasoning service resource scheduler according to the cluster resource use condition.

On the other hand, in the above reasoning service management method provided by the present invention, selecting, by using the reasoning service resource scheduler, a resource specification satisfying a set condition according to a cluster resource usage condition to deploy a language model reasoning service, including:

Screening nodes meeting the memory requirements of the processor in the cluster according to the configured processor and the memory parameters by utilizing the reasoning service resource scheduler;

Obtaining a resource specification list according to an inference service name, and selecting a resource specification providing the maximum throughput from the resource specification list;

judging whether the screened nodes meet the selected resource specification;

If the selected nodes meet the selected resource specification, selecting idle nodes in the selected nodes to deploy language model reasoning service;

if the selected nodes do not meet the selected resource specification, selecting corresponding nodes according to the ordering of the throughput to deploy the language model reasoning service.

On the other hand, in the above reasoning service management method provided by the present invention, after the language model reasoning service is deployed, the method further includes:

Recording the address of the reasoning service copy, and configuring the address of the reasoning service copy in the load balancer;

forwarding the language model dialogs to corresponding inference service replicas using the load balancer, comprising:

And forwarding the language model dialogue to the corresponding reasoning service copy by utilizing the load balancer according to the address of the configured reasoning service copy.

On the other hand, in the above reasoning service management method provided by the present invention, the statistics of throughput data of the reasoning service copy in a set period by using the reasoning service extension component includes:

calculating the throughput sum of each language model dialogue of each user in a set duration in an inference service copy by using the inference service expansion component;

and calculating the average value of the throughput of each language model dialogue of each user in the set duration by using the throughput sum and the set duration.

In order to solve the above technical problem, the present invention also provides an inference service management apparatus, the apparatus comprising:

a memory for storing a computer program;

And the processor is used for realizing the steps of the reasoning service management method when executing the computer program.

In order to solve the above technical problem, the present invention further provides a nonvolatile storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the inference service management method described above.

To solve the above technical problem, the present invention also provides a computer program product, including a computer program/instruction, which when executed by a processor, implements the steps of the above reasoning service management method.

The method comprises the steps of receiving a language model dialogue, forwarding the language model dialogue to corresponding reasoning service copies by using the load balancer, counting throughput data of the reasoning service copies in a set period by using the reasoning service expansion component when the language model reasoning service runs, sending a capacity expansion notification signal to the reasoning service resource scheduler by using the reasoning service expansion component when the counted throughput data exceeds a set throughput threshold value, so that the reasoning service resource scheduler performs reasoning service expansion to obtain capacity expansion copies, and configuring addresses of the capacity expansion copies by using the load balancer.

The invention has the beneficial effects that the reasoning service management method provided by the invention uses the load balancer to forward the language model dialogue to the corresponding reasoning service copy, uses the reasoning service expansion component to count the throughput data of the reasoning service copy in the set period, and when the data exceeds the set throughput threshold, sends the expansion notification signal to the reasoning service resource scheduler through the reasoning service expansion component, and uses the reasoning service resource scheduler to conduct the reasoning service expansion to obtain the expansion copy, and uses the load balancer to configure the address of the expansion copy. The method is suitable for the language model service scene, can optimize the inference service expansion mechanism and the load balancing strategy configuration in the language model service scene, takes throughput in the language model service as the basis for carrying out service expansion and load balancing selection, and triggers the expansion of the language model inference service when the throughput exceeds a threshold value, thereby optimizing allocation and use of resources, improving the resource utilization rate and improving the user experience.

In addition, the invention also provides a corresponding reasoning service management device, a nonvolatile storage medium and a computer program product aiming at the reasoning service management method, and the reasoning service management method has the same or corresponding technical characteristics and effects as the reasoning service management method.

Drawings

For a clearer description of embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

Fig. 1 is a flowchart of an inference service management method provided in an embodiment of the present invention;

fig. 2 is a schematic diagram of a framework corresponding to an inference service management method according to an embodiment of the present invention;

fig. 3 is a signaling interaction diagram corresponding to an inference service management method provided by an embodiment of the present invention;

Fig. 4 is a schematic structural diagram of an inference service management apparatus according to an embodiment of the present invention;

Fig. 5 is a schematic structural diagram of an inference service management apparatus according to an embodiment of the present invention.

Detailed Description

With the development of artificial intelligence technology, language models play an increasingly important role in the field of natural language processing, and can process complex language tasks such as text generation, translation, dialogue and the like. The reasoning service of the language model usually requires more resources of a graphic processor (Graphics Processing Unit, GPU), a single host can only deploy one language model service, and in the high concurrency scene of the Internet, the reasoning service is insufficient to meet the conversation requests of a large number of users, meanwhile, the data center usually accumulates the resources of GPU servers with different specifications, and the language model service needs to be run on the GPU servers with different specifications based on the asset multiplexing principle. Therefore, how to effectively manage the resources required by the reasoning service, expand the capacity of the language model reasoning service in real time, define a load balancing strategy suitable for the language model reasoning service, and become a key problem of improving the utilization rate of the resources and improving the user experience.

In the related technical scheme, a central processing unit (Central Processing Unit, CPU) and a memory are generally used as resource specifications, even if the central processing unit (Central Processing Unit, CPU) and the memory are operated on servers with different specifications, multiple copies of the same application have the same CPU memory specification, and default the multiple copies can provide the same maximum load capacity, and the multiple copies have the same weight when being elastically telescopic and are subjected to load balancing policy management, but for language model services, performance indexes of the multiple copies are strongly dependent on GPU computing power and video memory size of operation nodes, namely GPU models and quantity of operation nodes of the language model service, when different copies of the reasoning service are operated on nodes with different types, different throughput upper limits are generally generated, namely hardware GPU resources determine the maximum load capacity of the reasoning service, but related load balancing policies such as polling and minimum connection numbers are forwarded based on the fact that different services provide the same load capacity, and are not applicable to the language model service scene.

In addition, with the development of containerization technology, the automatic increase and decrease of service copies are realized based on an elastic telescoping mechanism, namely, the number of service instances is dynamically adjusted according to the actual load so as to adapt to the changing business demands. However, most of the flexible policies are based on simple resource usage indexes, such as CPU memory usage, query Per Second (QPS), etc., which generally do not consider performance indexes specific to language model reasoning services, such as throughput (token/s), different load capabilities of language model services running on nodes of different specifications, etc., and simple CPUs, memories, QPS, etc., cannot accurately represent the status of the reasoning services.

In addition, language model services typically use key value caching (KVCache) technology to multiplex word element (token) key value pair information that has been generated, to increase the reasoning speed, which requires multiple dialogs of the same topic dialog to use the same copy of the language model service, maximizing multiplexing key value caching, and related load balancing policies provide a certain session keeping capability, such as a load balancing policy based on source IP and session (session) keeping, but in a language model service scenario, there are typically dialogs with multiple topics on one client, and the source IP and session cannot distinguish between different topic dialogs and the existing session expiration failure problem, which is not applicable to the language model service scenario.

In order to solve the problems, the invention provides an inference service management method which can be applied to a language model inference service scene and optimize the configuration of a telescopic strategy and a load balancing strategy in the language model service scene.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present invention.

In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. Fig. 1 is a flowchart of an inference service management method provided in an embodiment of the present invention, and as shown in fig. 1, the method is used for an inference service platform including a load balancer, an inference service scaling component and an inference service resource scheduler, and includes:

S101, receiving the language model dialogue, and forwarding the language model dialogue to a corresponding reasoning service copy by using a load balancer.

The language model is a model for learning a large number of texts to grasp various language features such as the structure of a language, the use of words, and the composition rules of sentences. In the present invention, the language model may be a large language model (Large Language Model, LLM), or may be another type of language model, which is not limited herein. The language model reasoning service is an online reasoning service deployed based on a language model and supports multi-copy deployment. The language model reasoning service can be provided for Internet users in the form of multiple rounds of conversations, each round of conversations is generated by the language model service, namely, a long sequence, namely, the streaming output of a plurality of token, and the output of the reasoning service is output in the form of each Chinese character or word for the end user.

Fig. 2 is a schematic diagram of an architecture corresponding to an inference service management method according to an embodiment of the present invention. In implementation, as shown in FIG. 2, user session requests may be passed to a Web service (e.g., web service) through a proxy service (e.g., nginx proxy), which is passed to a load balancer. A load balancer is a device for distributing network traffic or workload among computing resources. Upon execution of step S101, the load balancer may receive the language model dialogue, and then forward the language model dialogue to the corresponding inference service replica.

S102, when the language model reasoning service runs, the reasoning service expansion component is utilized to count throughput data of the reasoning service copy in a set period.

It should be noted that the inference service extension component is a tool for dynamically adjusting the inference service resources. The inference service expansion component of the invention can count throughput data (i.e. number of tokens generated per second) of the inference service copies in a set period, so as to further realize dynamic adjustment of computing resources. The setting period may be determined according to practical situations, and is not limited herein. The throughput is a fine-grained performance index in the language model service, and the invention uses the throughput index as the basis for carrying out subsequent service capacity expansion.

In performing step S102, while the language model inference service is running, an inference service framework can be utilized to count throughput data for multiple copies of the inference service over a set period.

And S103, when the counted throughput data exceeds the set throughput threshold, sending a capacity expansion notification signal to the reasoning service resource scheduler through the reasoning service expansion component so that the reasoning service resource scheduler performs reasoning service capacity expansion to obtain a capacity expansion copy.

It should be noted that the inference service resource scheduler is a type of resource required for managing and allocating an inference service. The reasoning service resource scheduler can realize the capacity expansion of the reasoning service to obtain the capacity expansion copy so as to further process the reasoning task.

The present invention may define the set throughput threshold in advance before executing step S103. Different set throughput thresholds are defined for different copies of the same language model reasoning service, the different copies providing differentiated loading capabilities.

And when the step S103 is executed, judging whether the throughput data counted by the reasoning service expansion component exceeds a predefined corresponding set throughput threshold, if so, indicating that the processing speed of the reasoning service copy reaches a certain threshold, and sending an expansion notification signal to the reasoning service resource scheduler through the reasoning service expansion component so as to enable the reasoning service resource scheduler to perform reasoning service expansion and obtain the expansion copy.

S104, configuring the address of the capacity expansion copy by using a load balancer.

As shown in fig. 2, after the inference service resource scheduler obtains the expanded copy, the load balancer may be used to configure the address of the expanded copy, where the address may be assigned to different node specifications, e.g., a first node, a second node, a third node, a fourth node, etc.

In the above reasoning service management method provided by the embodiment of the invention, the language model dialogue is forwarded to the corresponding reasoning service copy by using the load balancer, the throughput data of the reasoning service copy in the set period is counted by using the reasoning service expansion component, when the data exceeds the set throughput threshold, the capacity expansion notification signal is sent to the reasoning service resource scheduler by using the reasoning service expansion component, the reasoning service expansion is carried out by using the reasoning service resource scheduler, the capacity expansion copy is obtained, and the address of the capacity expansion copy is configured by using the load balancer. Therefore, the inference service expansion mechanism and the load balancing strategy configuration in the language model service scene can be optimized, throughput in the language model service is used as a basis for carrying out service expansion and load balancing selection, and when the throughput exceeds a threshold value, the expansion of the language model inference service is triggered, so that the allocation and use of resources are optimized, the resource utilization rate is improved, and the user experience is improved.

Further, in the above-mentioned reasoning service management method provided by the embodiment of the present invention, step S101 receives a language model dialogue, and forwards the language model dialogue to a corresponding reasoning service copy by using a load balancer, which specifically includes receiving a first language model dialogue carrying a target dialogue identifier, determining a corresponding reasoning service copy according to the target dialogue identifier, and forwarding the first language model dialogue to the determined reasoning service copy by using the load balancer.

In implementation, the present invention introduces language model session Identifiers (IDs) for load balancing policy selection, i.e., takes session identifiers as consideration for service resiliency, load balancing policies. Different language model dialogs have different dialog identifiers.

Fig. 3 is a signaling interaction diagram corresponding to the reasoning service management method provided in the embodiment of the present invention. As shown in fig. 3, the corresponding inference service copy can be determined through the language model dialogue identifier, and the load balancer can forward the received language model dialogue to the determined inference service copy, so that the resource utilization rate and the running efficiency of the application program are improved.

Further, in the method for managing an inference service provided in the embodiment of the present invention, before receiving the first language model dialogue carrying the target dialogue identifier, the method may further include establishing an association relationship between the dialogue identifier and the inference service copy. That is, the language model dialogue identifier is bound with a certain reasoning service copy, and is used for multiplexing key value buffer information in the history dialogue to establish an association relationship between the dialogue identifier and the reasoning service copy, and the load balancer can save the association relationship between the dialogue identifier and the reasoning service copy.

Correspondingly, the step of determining the corresponding reasoning service copy according to the target dialogue identifier can specifically comprise the step of determining the reasoning service copy associated with the target dialogue identifier according to the established association relation between the dialogue identifier and the reasoning service copy and combining the target dialogue identifier.

In an implementation, after receiving a first language model dialogue carrying a target dialogue identifier, the target dialogue identifier is parsed, and a binding relationship between the established dialogue identifier and the reasoning service copy can be utilized to determine the reasoning service copy bound to the target dialogue identifier.

Further, in the implementation, the establishment of the association relationship between the dialogue identifier and the reasoning service copy in the steps can specifically comprise the steps of randomly generating a corresponding dialogue identifier when a language model dialogue is established for the first time, inquiring throughput real-time data of the reasoning service copy by using a load balancer, selecting the reasoning service copy with the minimum throughput, and recording the relationship between the randomly generated dialogue identifier and the address of the selected reasoning service copy so as to establish the association relationship between the dialogue identifier and the reasoning service copy.

In the implementation, when a corresponding language model dialogue is created for the first time, the association relation between a dialogue identifier and an inference service copy can be established, specifically, a corresponding dialogue identifier is randomly generated firstly, at the moment, a load balancer does not have the inference service copy information associated with the dialogue identifier, then the throughput real-time data of each inference service copy is queried by the load balancer, the inference service copy with the minimum throughput is selected, and finally, the association relation between the randomly generated dialogue identifier and the address of the selected inference service copy is recorded by the load balancer. When the dialog is requested again, the dialog can be forwarded preferentially to the copy of the reasoning service where historical dialog information exists (i.e., pre-use), enabling an increase in reasoning speed.

Further, in a specific implementation, the method for managing the inference service provided by the embodiment of the invention may further include forwarding the first language model dialogue to other inference service copies by using a load balancer and updating the association relationship between the dialogue identifier and the inference service copy if the throughput of the determined inference service copy has reached a maximum value or the inference service copy does not exist.

In implementations, when the throughput of the inference service replica has reached a maximum, or the inference service replica itself does not exist, the load balancer can be utilized to forward the first language model dialog to other inference service replicas and update the association of the dialog identifier with the inference service replica.

In addition, in the above-mentioned reasoning service management method provided by the embodiment of the present invention, step S101 receives a language model dialogue, and forwards the language model dialogue to a corresponding reasoning service copy by using a load balancer, and specifically may further include receiving a second language model dialogue that does not carry a dialogue identifier, selecting a reasoning service copy with a minimum ratio between real-time throughput and maximum throughput by using the load balancer, and forwarding the second language model dialogue to the selected reasoning service copy.

In an implementation, as shown in fig. 3, after receiving a second language model dialogue that does not carry a dialogue identifier, a load balancer may be utilized to select a copy of the inference service that has a minimum ratio between real-time throughput and maximum throughput, and forward the second language model dialogue to the selected copy of the inference service. That is, when there are multiple language model services of the inference service replicas in the cluster, the received language model dialogs do not carry a dialog identifier, the received dialogs may be preferentially forwarded to the larger throughput free inference service replicas.

The present invention can calculate the real-time throughput using the following formula, and calculate the ratio between the real-time throughput and the maximum throughput:

;

Wherein, The j-th session throughput for user i is indicated, t is the set duration,Representing the real-time throughput of the device,Representing the maximum throughput of the language model at nodes of different specifications,Representing the ratio between real-time throughput and maximum throughput.

Further, in a specific implementation, the method for managing inference service provided by the embodiment of the invention further comprises the steps of sending a contraction notification signal to the inference service resource scheduler through the inference service expansion component when the counted throughput data is zero, enabling the inference service resource scheduler to conduct inference service contraction to obtain a contraction copy, and moving out an address of the contraction copy by using the load balancer.

In implementation, as shown in fig. 3, when the throughput data of the set period statistics is zero, the capacity reduction notification signal can be sent to the reasoning service resource scheduler through the reasoning service expansion component, so that the reasoning service resource scheduler performs the reasoning service capacity reduction to obtain the capacity reduction Rong Fuben, and thus, the efficient operation of the reasoning service under different load conditions is ensured.

It should be noted that, the capacity reduction mechanism generally performs capacity reduction operation when the traffic is lower than a certain threshold, but the language model needs to use a local key value buffer to optimize the reasoning performance when running, and the same dialogue can obtain the best performance only by using the same reasoning service copy, so that the capacity reduction operation of the reasoning service of the language model needs to be triggered when no traffic is generated, the invention can select a set period to be 30 minutes, and when no request is received within 30 minutes, the reasoning service copy is reduced, but the reasoning service needs to keep at least one reasoning service copy.

Further, in a specific implementation, before receiving the language model dialogue, the reasoning service management method provided by the embodiment of the invention may further include performing throughput performance test on the language model service when nodes with multiple specifications exist in the cluster, counting maximum throughput performance data of the language model service on nodes with different specifications, and defining different set throughput thresholds for different reasoning service copies of the same language model service according to the counted maximum throughput performance data.

In implementation, when GPU nodes with multiple specifications exist in the artificial intelligent cluster, the invention can test throughput performance aiming at language model service to be deployed, and count maximum throughput performance data of the language model service on the GPU nodes with different specifications, wherein different set throughput thresholds are defined for different reasoning service copies of the same language model service, so that GPU resources are optimally allocated and used.

It should be noted that, in the language model service scenario, based on the data center asset multiplexing principle, different service copies of the same language model service are allowed to run on GPU nodes with different specifications, and the maximum throughput of the language model service on the GPU nodes with different specifications is used as an elastic expansion and load balancing index, that is, when the same service is on nodes with different specifications, the expansion index and the request forwarding index of the same service are different.

Taking the example of testing the throughput of the language model service to be deployed on different numbers of acceleration cards of different specifications, wherein each specification can meet the minimum resource requirement of the reasoning task, such as the throughput on 8A 100 cardsThroughput on 8H 100 cards isThroughput on 4a 100 cards is。

The obtained performance test data is configured to an inference service platform, and the resource specifications used for running the language model inference service are defined, for example, the tested several specifications are defined, namely 8H 100, 8A 100 and 4A 100, and an inference service resource scheduler can select a proper resource specification to deploy the language model inference service according to the cluster resource use condition. That is, the invention can deploy language model reasoning service by selecting resource specification meeting set condition through the reasoning service resource scheduler according to cluster resource use condition, and ensure the stability of reasoning operation.

The present invention can use the data in table 1 for resource scheduling and access policy management.

Table 1 performance test data

The method comprises the steps of selecting resource specifications meeting set conditions according to cluster resource use conditions by utilizing an inference service resource scheduler, and deploying language model inference service, wherein the method comprises the steps of screening nodes meeting memory requirements of a processor in a cluster according to a configured processor and memory parameters by utilizing the inference service resource scheduler, obtaining a resource specification list according to an inference service name, selecting the resource specification providing the maximum throughput from the resource specification list, judging whether the screened nodes meet the selected resource specification, selecting idle nodes in the screened nodes to deploy language model inference service if the screened nodes meet the selected resource specification, and selecting corresponding nodes to deploy language model inference service according to the ordering of the throughput if the screened nodes do not meet the selected resource specification.

It should be noted that, the language model service is deployed on the reasoning service platform, the parameters of the CPU and the memory are required to be set, but the number and the model of the GPU cards are not configured, and the reasoning service resource scheduler automatically selects the appropriate number of GPU cards. The CPU and the memory are configured to prevent misuse of the memory by the reasoning service, and the misuse of the CPU causes abnormal host nodes. The inference service platform performs scheduling according to cluster resources and inference service resource specifications, and the scheduling process can comprise the steps of firstly, screening nodes meeting CPU memory requirements in the cluster according to configured CPU and memory parameters by using an inference service scheduler. And then, obtaining usable resource specifications according to the reasoning service names, and selecting the resource specification capable of providing the maximum throughput to perform resource scheduling, namely automatically selecting the proper resource specification for scheduling based on the maximum throughput of the language model service at nodes with different specifications. Based on the strategy, the expansion times of the reasoning service can be reduced, more service accesses are directed to the same reasoning service copy, and under the limit condition, radixAttention (a method for optimizing the reasoning performance of the language model) can be used for caching in a mode of multiplexing key value cache to reduce redundant calculation, so that the reasoning service performance is improved.

When the free resources of the cluster cannot meet the selected resource specification, the resource specification providing less throughput is selected for scheduling, for example LLM-mode-2 in Table 1, whenIn the present invention, the resource specification selection sequence is 8H 100, 8 a100, 4 a100, that is, the corresponding nodes are selected according to the order of throughput from large to small. When the node still cannot be selected by using the resource specification with the minimum throughput for scheduling, the task scheduling fails, and the task is required to wait for other tasks to release resources and schedule again.

Further, in the method for managing the inference service provided by the embodiment of the invention, after the language model inference service is deployed, the method can further comprise the steps of recording the address of the inference service copy and configuring the address of the inference service copy in a load balancer.

Accordingly, forwarding the language model dialog to the corresponding inference service replica using the load balancer may specifically include forwarding the language model dialog to the corresponding inference service replica using the load balancer based on the address of the configured inference service replica.

In implementations, when the deployment of the inferencing service is complete, the IP address of the inferencing service replica can be recordedMaximum throughputAnd configuring the copy IP address in a load balancer arranged in the platform, automatically configuring and forwarding the load balancing service arranged in the platform, forwarding an external request to a node where the reasoning service is located, and providing the reasoning service to the outside.

Further, in the above-mentioned reasoning service management method provided by the embodiment of the present invention, step S102 may specifically include calculating, by using the reasoning service scaling component, a throughput sum of each language model session of each user in the reasoning service copy within a set duration, and calculating, by using the throughput sum and the set duration, a throughput average of each language model session of each user within the set duration.

In implementation, since each inference service provides multiple rounds of dialogue services for different users, throughput computation is generally performed for each dialogue of each user, so as to obtain a processing speed of a token of each dialogue, a statistical unit is token/s, and when throughput statistics is performed on the copy, throughput of each dialogue needs to be summarized. The invention can count the sum of the throughput of the access to the reasoning service copy in a set time period (such as 1 minute), calculate the average value of the throughput in the time period, and trigger the expansion of the reasoning service when the average value reaches a set threshold value. The invention can trigger the capacity expansion when the real-time throughput reaches 90% of the maximum throughput of the reasoning service copy, and the capacity of the reasoning service copy is contracted when no request is made in a certain period.

Taking LLM-mode-2 deployment in table 1 as an example in 8-card H100, the average throughput per session for each user is calculated for a set period of time (e.g., 1 minute), and when it is greater than 90% of the maximum throughput of the copy, the expansion operation is triggered. The 90% choice here is due to the difference in length between the user request and the model response, and the randomness of the number of tokens generated by the model in the Prompt (Prompt) and Generate (Generate) phases, resulting in an uncertainty in the amount of GPU resources occupied by a single request, thus requiring a 10% reserve of free space. The specific formula is as follows:

;

Wherein, The j-th session throughput of user i is represented, t represents a set duration, and t may be set to 60s.

It should be added that, for reasoning services running across other types of acceleration cards such as GPU, neural processing unit (Neural Processing Unit, NPU), etc., the present invention may also be used, that is, part of the copies run on GPU nodes, and part of the copies run on other types of acceleration card nodes, which will not be described herein.

In the above embodiments, the present invention further provides an embodiment corresponding to the reasoning service management apparatus and the reasoning service management device. It should be noted that the present invention describes an embodiment of the device portion from two angles, one based on the angle of the functional module and the other based on the angle of the hardware.

Fig. 4 is a schematic structural diagram of an inference service management apparatus according to an embodiment of the present invention. The embodiment is based on the angle of the functional module, and the device comprises:

the copy forwarding module 10 is configured to receive the language model dialogue and forward the language model dialogue to a corresponding reasoning service copy;

The data statistics module 11 is used for counting throughput data of the reasoning service copies in a set period when the language model reasoning service runs;

The copy capacity expansion module 12 is configured to perform reasoning service capacity expansion to obtain a capacity expansion copy when the counted throughput data exceeds a set throughput threshold;

An address configuration module 13, configured to configure the address of the expanded copy.

In the reasoning service management device provided by the embodiment of the invention, through interaction of the four modules, the language model dialogue can be forwarded to the corresponding reasoning service copy, throughput data of the reasoning service copy in a set period is counted, when the data exceeds a set throughput threshold, the reasoning service is expanded, an expanded copy is obtained, and an address of the expanded copy is configured. Therefore, the inference service expansion mechanism and the load balancing strategy configuration in the language model service scene can be optimized, throughput in the language model service is used as a basis for carrying out service expansion and load balancing selection, and when the throughput exceeds a threshold value, the expansion of the language model inference service is triggered, so that the allocation and use of resources are optimized, the resource utilization rate is improved, and the user experience is improved.

Since the embodiments of the apparatus portion and the embodiments of the method portion correspond to each other, the embodiments of the apparatus portion are referred to the description of the embodiments of the method portion, and are not repeated herein. And has the same advantageous effects as the above-mentioned reasoning service management method.

Fig. 5 is a schematic structural diagram of an inference service management apparatus according to an embodiment of the present invention. The present embodiment is based on a hardware point of view, and as shown in fig. 5, the inference service management apparatus includes:

a memory 20 for storing a computer program;

a processor 21 for implementing the steps of the inference service management method as mentioned in the above embodiments when executing a computer program.

Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The Processor 21 may be implemented in at least one hardware form of a digital signal Processor (DIGITAL SIGNAL Processor, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 21 may also comprise a main processor, which is a processor for processing data in a wake-up state, also called CPU, and a co-processor, which is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a graphics processor (Graphics Processing Unit, GPU) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 21 may also include an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor for processing computing operations related to machine learning.

The memory 20 may include one or more non-volatile storage media, which may be non-transitory. Memory 20 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing a computer program 201, which, when loaded and executed by the processor 21, is capable of implementing the relevant steps of the reasoning service management method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may further include an operating system 202, data 203, and the like, where the storage manner may be transient storage or permanent storage. Operating system 202 may include Windows, unix, linux, among other things. The data 203 may include, but is not limited to, data related to the above-mentioned inference service management method, and the like.

In some embodiments, the inference service management device may further include a display 22, an input-output interface 23, a communication interface 24, a power supply 25, and a communication bus 26. Those skilled in the art will appreciate that the structure shown in fig. 5 does not constitute a limitation of the inference service management apparatus and may include more or less components than illustrated. The reasoning service management device provided by the embodiment of the invention comprises the memory and the processor, and the processor can realize the following method when executing the program stored in the memory, wherein the effects of the reasoning service management method are the same as those of the reasoning service management method.

Finally, the invention also provides a corresponding embodiment of the nonvolatile storage medium. The nonvolatile storage medium has stored thereon a computer program which, when executed by a processor, performs the steps described in the method embodiments described above.

It will be appreciated that the methods of the above embodiments, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium for performing all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes. The nonvolatile storage medium provided by the invention can realize the reasoning service management method, and the effects are the same as those of the reasoning service management method.

Finally, the invention also provides a corresponding embodiment of the computer program product. The computer program product comprises computer programs/instructions which when executed by the processor implement the steps as described in the embodiments of the reasoning service management method described above. The computer program product provided by the invention can realize the above-mentioned reasoning service management method, and the effects are the same as the above.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The inference service management method, the device, the medium and the computer program product provided by the invention are described in detail above. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.

Claims

1. A method for managing an inference service, characterized in that it is used for an inference service platform including a load balancer, an inference service scaling component and an inference service resource scheduler, the method comprising:

receiving a language model conversation, and forwarding the language model conversation to a corresponding inference service replica using the load balancer;

When the language model inference service is running, the inference service scaling component is used to count the throughput data of the inference service replica within a set period;

When the statistical throughput data exceeds the set throughput threshold, an expansion notification signal is sent to the reasoning service resource scheduler through the reasoning service scaling component, so that the reasoning service resource scheduler performs reasoning service expansion and obtains an expanded copy;

The load balancer is used to configure the address of the expanded replica.

2. The inference service management method according to claim 1, characterized in that receiving the language model dialogue and forwarding the language model dialogue to the corresponding inference service replica using the load balancer comprises:

Receiving a first language model dialog carrying a target dialog identifier;

Determine a corresponding reasoning service copy according to the target conversation identifier;

The first language model dialog is forwarded to the determined inference service replica using the load balancer.

3. The inference service management method according to claim 2, characterized in that before receiving the first language model dialogue carrying the target dialogue identifier, it also includes:

Establishing an association between the conversation identifier and the inference service copy;

Determining a corresponding reasoning service copy according to the target conversation identifier includes:

According to the established association relationship between the dialog identifier and the inference service copy, in combination with the target dialog identifier, the inference service copy associated with the target dialog identifier is determined.

4. The method for managing reasoning services according to claim 3, wherein establishing an association relationship between a conversation identifier and a reasoning service copy comprises:

When a language model conversation is first created, a corresponding conversation identifier is randomly generated;

Utilize the load balancer to query the throughput real-time data of the inference service replicas, and select the inference service replica with the minimum throughput;

The relationship between the randomly generated conversation identifier and the address of the selected inference service copy is recorded to establish an association relationship between the conversation identifier and the inference service copy.

5. The inference service management method according to claim 3, further comprising:

If the throughput of the determined inference service replica has reached the maximum value or the inference service replica does not exist, the load balancer is used to forward the first language model dialogue to other inference service replicas, and the association relationship between the dialogue identifier and the inference service replica is updated.

6. The inference service management method according to claim 1, wherein receiving the language model dialogue and forwarding the language model dialogue to the corresponding inference service replica using the load balancer comprises:

receiving a second language model dialogue without a dialogue identifier;

Selecting, using the load balancer, an inference service replica having a minimum ratio between real-time throughput and maximum throughput;

The second language model dialog is forwarded to the selected inference service replica.

7. The inference service management method according to claim 1, further comprising:

When the statistical throughput data is zero, sending a capacity reduction notification signal to the reasoning service resource scheduler through the reasoning service scaling component, so that the reasoning service resource scheduler performs capacity reduction of the reasoning service and obtains a reduced copy;

The load balancer is used to remove the address of the reduced-capacity replica.

8. The inference service management method according to claim 1, characterized in that before receiving the language model dialogue, it also includes:

When there are nodes of different specifications in the cluster, perform throughput performance tests on the language model service and collect statistics on the maximum throughput performance data of the language model service on nodes of different specifications.

Based on the statistical maximum throughput performance data, different throughput thresholds are defined for different inference service copies of the same language model service.

9. The inference service management method according to claim 1, characterized in that before receiving the language model dialogue, it also includes:

The inference service resource scheduler is used to select resource specifications that meet set conditions to deploy the language model inference service according to cluster resource usage.

10. The inference service management method according to claim 9, characterized in that the inference service resource scheduler is used to select resource specifications that meet set conditions to deploy the language model inference service according to cluster resource usage, comprising:

Using the inference service resource scheduler to select nodes in the cluster that meet the processor memory requirements according to the configured processor and memory parameters;

Obtain a resource specification list according to the inference service name, and select a resource specification that provides the maximum throughput from the resource specification list;

Determine whether the selected nodes meet the selected resource specifications;

If the selected nodes meet the selected resource specifications, an idle node among the selected nodes is selected to deploy the language model inference service;

If the selected nodes do not meet the selected resource specifications, the corresponding nodes are selected according to the order of throughput size to deploy the language model inference service.

11. The inference service management method according to claim 10, characterized in that after deploying the language model inference service, it also includes:

Recording the address of the inference service replica, and configuring the address of the inference service replica in the load balancer;

Forwarding the language model dialogue to a corresponding inference service replica using the load balancer includes:

The load balancer is used to forward the language model dialogue to the corresponding inference service replica according to the address of the configured inference service replica.

12. The inference service management method according to claim 1, characterized in that the inference service scaling component is used to count the throughput data of the inference service replica within a set period, comprising:

Calculate the total throughput of each language model conversation of each user in the inference service replica within a set time period using the inference service scaling component;

The throughput sum and the set duration are used to calculate the average throughput of each language model dialogue for each user within the set duration.

13. A reasoning service management device, characterized in that the device comprises:

Memory for storing computer programs;

A processor, configured to implement the steps of the inference service management method according to any one of claims 1 to 12 when executing the computer program.

14. A non-volatile storage medium, characterized in that a computer program is stored on the non-volatile storage medium, and when the computer program is executed by a processor, the steps of the inference service management method according to any one of claims 1 to 12 are implemented.

15. A computer program product, comprising a computer program/instruction, wherein when the computer program/instruction is executed by a processor, the steps of the inference service management method according to any one of claims 1 to 12 are implemented.