[go: up one dir, main page]

CN119415273A - Reasoning service management method, device, medium and computer program product - Google Patents

Reasoning service management method, device, medium and computer program product Download PDF

Info

Publication number
CN119415273A
CN119415273A CN202510025516.XA CN202510025516A CN119415273A CN 119415273 A CN119415273 A CN 119415273A CN 202510025516 A CN202510025516 A CN 202510025516A CN 119415273 A CN119415273 A CN 119415273A
Authority
CN
China
Prior art keywords
service
language model
inference service
inference
reasoning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510025516.XA
Other languages
Chinese (zh)
Inventor
王德奎
陈培
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Beijing Electronic Information Industry Co Ltd
Original Assignee
Inspur Beijing Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Beijing Electronic Information Industry Co Ltd filed Critical Inspur Beijing Electronic Information Industry Co Ltd
Priority to CN202510025516.XA priority Critical patent/CN119415273A/en
Publication of CN119415273A publication Critical patent/CN119415273A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Multi Processors (AREA)

Abstract

The invention relates to the technical field of computers and discloses an inference service management method, device, medium and computer program product, comprising the steps of receiving a language model dialogue and forwarding the dialogue to corresponding inference service copies by utilizing a load balancer; when the throughput data exceeds the set throughput threshold, the capacity expansion notification signal is sent to the reasoning service resource scheduler through the reasoning service expansion component so that the reasoning service resource scheduler performs reasoning service expansion to obtain capacity expansion copies, and the address of the capacity expansion copies is configured by the load balancer. In this way, throughput in the language model service is used as a basis for carrying out service capacity expansion and load balancing selection, and capacity expansion of the language model reasoning service is triggered when the throughput exceeds a threshold value, so that resources are optimally allocated and used, resource utilization rate can be improved, and user experience can be improved.

Description

Inference service management method, device, medium and computer program product
Technical Field
The present invention relates to the field of computer technology, and in particular, to a method, apparatus, medium, and computer program product for managing inference services.
Background
The reasoning service of the language model requires more resources of a graphics processor (Graphics Processing Unit, GPU), and a single host can only deploy one service, so that the dialog requests of a large number of users can not be met under the high concurrency scene of the internet. Meanwhile, the data center generally accumulates GPU server resources with different specifications, and language model services need to be run on the GPU servers with different specifications based on an asset multiplexing principle. In the related technical scheme, multiple service copies of the same application have the same processor memory specification, the service copies have the same weight when in elastic expansion and contraction, the load balancing strategies such as polling and minimum connection numbers are all forwarded based on the same load capacity provided by different services and are not applicable to the service scene of the language model, the elastic expansion strategies are mostly based on simple resource utilization indexes, the special performance indexes of the language model reasoning service are not considered, and the expansion of the language model reasoning service cannot be realized.
Disclosure of Invention
The invention aims to provide an inference service management method, device, medium and computer program product, which are suitable for a language model service scene and can realize the expansion of the language model inference service, so as to optimally allocate and use resources, improve the utilization rate of the resources and perfect user experience.
In order to solve the technical problems, the invention provides an inference service management method, which is used for an inference service platform comprising a load balancer, an inference service telescoping component and an inference service resource scheduler, and comprises the following steps:
Receiving a language model dialogue, and forwarding the language model dialogue to a corresponding reasoning service copy by utilizing the load balancer;
When the language model reasoning service runs, the reasoning service expansion component is utilized to count throughput data of the reasoning service copy in a set period;
When the counted throughput data exceeds a set throughput threshold, sending a capacity expansion notification signal to the reasoning service resource scheduler through the reasoning service expansion component so as to enable the reasoning service resource scheduler to perform reasoning service expansion and obtain an expansion copy;
and configuring the address of the capacity-expanded copy by using the load balancer.
In a first aspect, in the above method for managing an inference service provided by the present invention, receiving a language model dialogue, forwarding the language model dialogue to a corresponding inference service copy by using the load balancer, including:
Receiving a first language model dialog carrying a target dialog identifier;
determining a corresponding reasoning service copy according to the target dialogue identifier;
Forwarding the first language model dialogue to the determined inference service replica using the load balancer.
On the other hand, in the above-mentioned reasoning service management method provided by the present invention, before receiving the first language model dialogue carrying the target dialogue identifier, the method further includes:
establishing an association relation between the dialogue identifier and the reasoning service copy;
Determining a corresponding reasoning service copy according to the target dialogue identifier, including:
And determining the reasoning service copy associated with the target dialogue identifier by combining the target dialogue identifier according to the established association relation between the dialogue identifier and the reasoning service copy.
On the other hand, in the above reasoning service management method provided by the present invention, establishing an association relationship between a dialogue identifier and a reasoning service copy includes:
when a language model dialogue is created for the first time, a corresponding dialogue identifier is randomly generated;
inquiring throughput real-time data of the inference service copy by using the load balancer, and selecting the inference service copy with the minimum throughput;
the relationship between the randomly generated dialogue identifier and the address of the selected inference service copy is recorded to establish an association of the dialogue identifier with the inference service copy.
On the other hand, the reasoning service management method provided by the invention further comprises the following steps:
And if the throughput of the determined reasoning service copy reaches the maximum value or the reasoning service copy does not exist, forwarding the first language model dialogue to other reasoning service copies by using the load balancer, and updating the association relation between the dialogue identifier and the reasoning service copy.
On the other hand, in the above-mentioned reasoning service management method provided by the present invention, receiving a language model dialogue, forwarding the language model dialogue to a corresponding reasoning service copy by using the load balancer, including:
receiving a second language model dialog that does not carry a dialog identifier;
selecting an inference service copy with the smallest ratio between real-time throughput and maximum throughput by using the load balancer;
forwarding the second language model dialogue to the selected inference service replica.
On the other hand, the reasoning service management method provided by the invention further comprises the following steps:
When the counted throughput data is zero, sending a capacity reduction notification signal to the reasoning service resource scheduler through the reasoning service expansion component so that the reasoning service resource scheduler performs reasoning service capacity reduction to obtain capacity reduction copies;
and shifting out the address of the scaled copy by using the load balancer.
On the other hand, in the above-mentioned reasoning service management method provided by the present invention, before receiving the language model dialogue, it further includes:
when nodes with various specifications exist in the cluster, carrying out throughput performance test on the language model service, and counting the maximum throughput performance data of the language model service on the nodes with different specifications;
different set throughput thresholds are defined for different inference service copies of the same language model service according to the statistical maximum throughput performance data.
On the other hand, in the above-mentioned reasoning service management method provided by the present invention, before receiving the language model dialogue, it further includes:
And selecting the resource specification meeting the set condition to deploy the language model reasoning service by utilizing the reasoning service resource scheduler according to the cluster resource use condition.
On the other hand, in the above reasoning service management method provided by the present invention, selecting, by using the reasoning service resource scheduler, a resource specification satisfying a set condition according to a cluster resource usage condition to deploy a language model reasoning service, including:
Screening nodes meeting the memory requirements of the processor in the cluster according to the configured processor and the memory parameters by utilizing the reasoning service resource scheduler;
Obtaining a resource specification list according to an inference service name, and selecting a resource specification providing the maximum throughput from the resource specification list;
judging whether the screened nodes meet the selected resource specification;
If the selected nodes meet the selected resource specification, selecting idle nodes in the selected nodes to deploy language model reasoning service;
if the selected nodes do not meet the selected resource specification, selecting corresponding nodes according to the ordering of the throughput to deploy the language model reasoning service.
On the other hand, in the above reasoning service management method provided by the present invention, after the language model reasoning service is deployed, the method further includes:
Recording the address of the reasoning service copy, and configuring the address of the reasoning service copy in the load balancer;
forwarding the language model dialogs to corresponding inference service replicas using the load balancer, comprising:
And forwarding the language model dialogue to the corresponding reasoning service copy by utilizing the load balancer according to the address of the configured reasoning service copy.
On the other hand, in the above reasoning service management method provided by the present invention, the statistics of throughput data of the reasoning service copy in a set period by using the reasoning service extension component includes:
calculating the throughput sum of each language model dialogue of each user in a set duration in an inference service copy by using the inference service expansion component;
and calculating the average value of the throughput of each language model dialogue of each user in the set duration by using the throughput sum and the set duration.
In order to solve the above technical problem, the present invention also provides an inference service management apparatus, the apparatus comprising:
a memory for storing a computer program;
And the processor is used for realizing the steps of the reasoning service management method when executing the computer program.
In order to solve the above technical problem, the present invention further provides a nonvolatile storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the inference service management method described above.
To solve the above technical problem, the present invention also provides a computer program product, including a computer program/instruction, which when executed by a processor, implements the steps of the above reasoning service management method.
The method comprises the steps of receiving a language model dialogue, forwarding the language model dialogue to corresponding reasoning service copies by using the load balancer, counting throughput data of the reasoning service copies in a set period by using the reasoning service expansion component when the language model reasoning service runs, sending a capacity expansion notification signal to the reasoning service resource scheduler by using the reasoning service expansion component when the counted throughput data exceeds a set throughput threshold value, so that the reasoning service resource scheduler performs reasoning service expansion to obtain capacity expansion copies, and configuring addresses of the capacity expansion copies by using the load balancer.
The invention has the beneficial effects that the reasoning service management method provided by the invention uses the load balancer to forward the language model dialogue to the corresponding reasoning service copy, uses the reasoning service expansion component to count the throughput data of the reasoning service copy in the set period, and when the data exceeds the set throughput threshold, sends the expansion notification signal to the reasoning service resource scheduler through the reasoning service expansion component, and uses the reasoning service resource scheduler to conduct the reasoning service expansion to obtain the expansion copy, and uses the load balancer to configure the address of the expansion copy. The method is suitable for the language model service scene, can optimize the inference service expansion mechanism and the load balancing strategy configuration in the language model service scene, takes throughput in the language model service as the basis for carrying out service expansion and load balancing selection, and triggers the expansion of the language model inference service when the throughput exceeds a threshold value, thereby optimizing allocation and use of resources, improving the resource utilization rate and improving the user experience.
In addition, the invention also provides a corresponding reasoning service management device, a nonvolatile storage medium and a computer program product aiming at the reasoning service management method, and the reasoning service management method has the same or corresponding technical characteristics and effects as the reasoning service management method.
Drawings
For a clearer description of embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
Fig. 1 is a flowchart of an inference service management method provided in an embodiment of the present invention;
fig. 2 is a schematic diagram of a framework corresponding to an inference service management method according to an embodiment of the present invention;
fig. 3 is a signaling interaction diagram corresponding to an inference service management method provided by an embodiment of the present invention;
Fig. 4 is a schematic structural diagram of an inference service management apparatus according to an embodiment of the present invention;
Fig. 5 is a schematic structural diagram of an inference service management apparatus according to an embodiment of the present invention.
Detailed Description
With the development of artificial intelligence technology, language models play an increasingly important role in the field of natural language processing, and can process complex language tasks such as text generation, translation, dialogue and the like. The reasoning service of the language model usually requires more resources of a graphic processor (Graphics Processing Unit, GPU), a single host can only deploy one language model service, and in the high concurrency scene of the Internet, the reasoning service is insufficient to meet the conversation requests of a large number of users, meanwhile, the data center usually accumulates the resources of GPU servers with different specifications, and the language model service needs to be run on the GPU servers with different specifications based on the asset multiplexing principle. Therefore, how to effectively manage the resources required by the reasoning service, expand the capacity of the language model reasoning service in real time, define a load balancing strategy suitable for the language model reasoning service, and become a key problem of improving the utilization rate of the resources and improving the user experience.
In the related technical scheme, a central processing unit (Central Processing Unit, CPU) and a memory are generally used as resource specifications, even if the central processing unit (Central Processing Unit, CPU) and the memory are operated on servers with different specifications, multiple copies of the same application have the same CPU memory specification, and default the multiple copies can provide the same maximum load capacity, and the multiple copies have the same weight when being elastically telescopic and are subjected to load balancing policy management, but for language model services, performance indexes of the multiple copies are strongly dependent on GPU computing power and video memory size of operation nodes, namely GPU models and quantity of operation nodes of the language model service, when different copies of the reasoning service are operated on nodes with different types, different throughput upper limits are generally generated, namely hardware GPU resources determine the maximum load capacity of the reasoning service, but related load balancing policies such as polling and minimum connection numbers are forwarded based on the fact that different services provide the same load capacity, and are not applicable to the language model service scene.
In addition, with the development of containerization technology, the automatic increase and decrease of service copies are realized based on an elastic telescoping mechanism, namely, the number of service instances is dynamically adjusted according to the actual load so as to adapt to the changing business demands. However, most of the flexible policies are based on simple resource usage indexes, such as CPU memory usage, query Per Second (QPS), etc., which generally do not consider performance indexes specific to language model reasoning services, such as throughput (token/s), different load capabilities of language model services running on nodes of different specifications, etc., and simple CPUs, memories, QPS, etc., cannot accurately represent the status of the reasoning services.
In addition, language model services typically use key value caching (KVCache) technology to multiplex word element (token) key value pair information that has been generated, to increase the reasoning speed, which requires multiple dialogs of the same topic dialog to use the same copy of the language model service, maximizing multiplexing key value caching, and related load balancing policies provide a certain session keeping capability, such as a load balancing policy based on source IP and session (session) keeping, but in a language model service scenario, there are typically dialogs with multiple topics on one client, and the source IP and session cannot distinguish between different topic dialogs and the existing session expiration failure problem, which is not applicable to the language model service scenario.
In order to solve the problems, the invention provides an inference service management method which can be applied to a language model inference service scene and optimize the configuration of a telescopic strategy and a load balancing strategy in the language model service scene.
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present invention.
In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. Fig. 1 is a flowchart of an inference service management method provided in an embodiment of the present invention, and as shown in fig. 1, the method is used for an inference service platform including a load balancer, an inference service scaling component and an inference service resource scheduler, and includes:
S101, receiving the language model dialogue, and forwarding the language model dialogue to a corresponding reasoning service copy by using a load balancer.
The language model is a model for learning a large number of texts to grasp various language features such as the structure of a language, the use of words, and the composition rules of sentences. In the present invention, the language model may be a large language model (Large Language Model, LLM), or may be another type of language model, which is not limited herein. The language model reasoning service is an online reasoning service deployed based on a language model and supports multi-copy deployment. The language model reasoning service can be provided for Internet users in the form of multiple rounds of conversations, each round of conversations is generated by the language model service, namely, a long sequence, namely, the streaming output of a plurality of token, and the output of the reasoning service is output in the form of each Chinese character or word for the end user.
Fig. 2 is a schematic diagram of an architecture corresponding to an inference service management method according to an embodiment of the present invention. In implementation, as shown in FIG. 2, user session requests may be passed to a Web service (e.g., web service) through a proxy service (e.g., nginx proxy), which is passed to a load balancer. A load balancer is a device for distributing network traffic or workload among computing resources. Upon execution of step S101, the load balancer may receive the language model dialogue, and then forward the language model dialogue to the corresponding inference service replica.
S102, when the language model reasoning service runs, the reasoning service expansion component is utilized to count throughput data of the reasoning service copy in a set period.
It should be noted that the inference service extension component is a tool for dynamically adjusting the inference service resources. The inference service expansion component of the invention can count throughput data (i.e. number of tokens generated per second) of the inference service copies in a set period, so as to further realize dynamic adjustment of computing resources. The setting period may be determined according to practical situations, and is not limited herein. The throughput is a fine-grained performance index in the language model service, and the invention uses the throughput index as the basis for carrying out subsequent service capacity expansion.
In performing step S102, while the language model inference service is running, an inference service framework can be utilized to count throughput data for multiple copies of the inference service over a set period.
And S103, when the counted throughput data exceeds the set throughput threshold, sending a capacity expansion notification signal to the reasoning service resource scheduler through the reasoning service expansion component so that the reasoning service resource scheduler performs reasoning service capacity expansion to obtain a capacity expansion copy.
It should be noted that the inference service resource scheduler is a type of resource required for managing and allocating an inference service. The reasoning service resource scheduler can realize the capacity expansion of the reasoning service to obtain the capacity expansion copy so as to further process the reasoning task.
The present invention may define the set throughput threshold in advance before executing step S103. Different set throughput thresholds are defined for different copies of the same language model reasoning service, the different copies providing differentiated loading capabilities.
And when the step S103 is executed, judging whether the throughput data counted by the reasoning service expansion component exceeds a predefined corresponding set throughput threshold, if so, indicating that the processing speed of the reasoning service copy reaches a certain threshold, and sending an expansion notification signal to the reasoning service resource scheduler through the reasoning service expansion component so as to enable the reasoning service resource scheduler to perform reasoning service expansion and obtain the expansion copy.
S104, configuring the address of the capacity expansion copy by using a load balancer.
As shown in fig. 2, after the inference service resource scheduler obtains the expanded copy, the load balancer may be used to configure the address of the expanded copy, where the address may be assigned to different node specifications, e.g., a first node, a second node, a third node, a fourth node, etc.
In the above reasoning service management method provided by the embodiment of the invention, the language model dialogue is forwarded to the corresponding reasoning service copy by using the load balancer, the throughput data of the reasoning service copy in the set period is counted by using the reasoning service expansion component, when the data exceeds the set throughput threshold, the capacity expansion notification signal is sent to the reasoning service resource scheduler by using the reasoning service expansion component, the reasoning service expansion is carried out by using the reasoning service resource scheduler, the capacity expansion copy is obtained, and the address of the capacity expansion copy is configured by using the load balancer. Therefore, the inference service expansion mechanism and the load balancing strategy configuration in the language model service scene can be optimized, throughput in the language model service is used as a basis for carrying out service expansion and load balancing selection, and when the throughput exceeds a threshold value, the expansion of the language model inference service is triggered, so that the allocation and use of resources are optimized, the resource utilization rate is improved, and the user experience is improved.
Further, in the above-mentioned reasoning service management method provided by the embodiment of the present invention, step S101 receives a language model dialogue, and forwards the language model dialogue to a corresponding reasoning service copy by using a load balancer, which specifically includes receiving a first language model dialogue carrying a target dialogue identifier, determining a corresponding reasoning service copy according to the target dialogue identifier, and forwarding the first language model dialogue to the determined reasoning service copy by using the load balancer.
In implementation, the present invention introduces language model session Identifiers (IDs) for load balancing policy selection, i.e., takes session identifiers as consideration for service resiliency, load balancing policies. Different language model dialogs have different dialog identifiers.
Fig. 3 is a signaling interaction diagram corresponding to the reasoning service management method provided in the embodiment of the present invention. As shown in fig. 3, the corresponding inference service copy can be determined through the language model dialogue identifier, and the load balancer can forward the received language model dialogue to the determined inference service copy, so that the resource utilization rate and the running efficiency of the application program are improved.
Further, in the method for managing an inference service provided in the embodiment of the present invention, before receiving the first language model dialogue carrying the target dialogue identifier, the method may further include establishing an association relationship between the dialogue identifier and the inference service copy. That is, the language model dialogue identifier is bound with a certain reasoning service copy, and is used for multiplexing key value buffer information in the history dialogue to establish an association relationship between the dialogue identifier and the reasoning service copy, and the load balancer can save the association relationship between the dialogue identifier and the reasoning service copy.
Correspondingly, the step of determining the corresponding reasoning service copy according to the target dialogue identifier can specifically comprise the step of determining the reasoning service copy associated with the target dialogue identifier according to the established association relation between the dialogue identifier and the reasoning service copy and combining the target dialogue identifier.
In an implementation, after receiving a first language model dialogue carrying a target dialogue identifier, the target dialogue identifier is parsed, and a binding relationship between the established dialogue identifier and the reasoning service copy can be utilized to determine the reasoning service copy bound to the target dialogue identifier.
Further, in the implementation, the establishment of the association relationship between the dialogue identifier and the reasoning service copy in the steps can specifically comprise the steps of randomly generating a corresponding dialogue identifier when a language model dialogue is established for the first time, inquiring throughput real-time data of the reasoning service copy by using a load balancer, selecting the reasoning service copy with the minimum throughput, and recording the relationship between the randomly generated dialogue identifier and the address of the selected reasoning service copy so as to establish the association relationship between the dialogue identifier and the reasoning service copy.
In the implementation, when a corresponding language model dialogue is created for the first time, the association relation between a dialogue identifier and an inference service copy can be established, specifically, a corresponding dialogue identifier is randomly generated firstly, at the moment, a load balancer does not have the inference service copy information associated with the dialogue identifier, then the throughput real-time data of each inference service copy is queried by the load balancer, the inference service copy with the minimum throughput is selected, and finally, the association relation between the randomly generated dialogue identifier and the address of the selected inference service copy is recorded by the load balancer. When the dialog is requested again, the dialog can be forwarded preferentially to the copy of the reasoning service where historical dialog information exists (i.e., pre-use), enabling an increase in reasoning speed.
Further, in a specific implementation, the method for managing the inference service provided by the embodiment of the invention may further include forwarding the first language model dialogue to other inference service copies by using a load balancer and updating the association relationship between the dialogue identifier and the inference service copy if the throughput of the determined inference service copy has reached a maximum value or the inference service copy does not exist.
In implementations, when the throughput of the inference service replica has reached a maximum, or the inference service replica itself does not exist, the load balancer can be utilized to forward the first language model dialog to other inference service replicas and update the association of the dialog identifier with the inference service replica.
In addition, in the above-mentioned reasoning service management method provided by the embodiment of the present invention, step S101 receives a language model dialogue, and forwards the language model dialogue to a corresponding reasoning service copy by using a load balancer, and specifically may further include receiving a second language model dialogue that does not carry a dialogue identifier, selecting a reasoning service copy with a minimum ratio between real-time throughput and maximum throughput by using the load balancer, and forwarding the second language model dialogue to the selected reasoning service copy.
In an implementation, as shown in fig. 3, after receiving a second language model dialogue that does not carry a dialogue identifier, a load balancer may be utilized to select a copy of the inference service that has a minimum ratio between real-time throughput and maximum throughput, and forward the second language model dialogue to the selected copy of the inference service. That is, when there are multiple language model services of the inference service replicas in the cluster, the received language model dialogs do not carry a dialog identifier, the received dialogs may be preferentially forwarded to the larger throughput free inference service replicas.
The present invention can calculate the real-time throughput using the following formula, and calculate the ratio between the real-time throughput and the maximum throughput:
;
Wherein, The j-th session throughput for user i is indicated, t is the set duration,Representing the real-time throughput of the device,Representing the maximum throughput of the language model at nodes of different specifications,Representing the ratio between real-time throughput and maximum throughput.
Further, in a specific implementation, the method for managing inference service provided by the embodiment of the invention further comprises the steps of sending a contraction notification signal to the inference service resource scheduler through the inference service expansion component when the counted throughput data is zero, enabling the inference service resource scheduler to conduct inference service contraction to obtain a contraction copy, and moving out an address of the contraction copy by using the load balancer.
In implementation, as shown in fig. 3, when the throughput data of the set period statistics is zero, the capacity reduction notification signal can be sent to the reasoning service resource scheduler through the reasoning service expansion component, so that the reasoning service resource scheduler performs the reasoning service capacity reduction to obtain the capacity reduction Rong Fuben, and thus, the efficient operation of the reasoning service under different load conditions is ensured.
It should be noted that, the capacity reduction mechanism generally performs capacity reduction operation when the traffic is lower than a certain threshold, but the language model needs to use a local key value buffer to optimize the reasoning performance when running, and the same dialogue can obtain the best performance only by using the same reasoning service copy, so that the capacity reduction operation of the reasoning service of the language model needs to be triggered when no traffic is generated, the invention can select a set period to be 30 minutes, and when no request is received within 30 minutes, the reasoning service copy is reduced, but the reasoning service needs to keep at least one reasoning service copy.
Further, in a specific implementation, before receiving the language model dialogue, the reasoning service management method provided by the embodiment of the invention may further include performing throughput performance test on the language model service when nodes with multiple specifications exist in the cluster, counting maximum throughput performance data of the language model service on nodes with different specifications, and defining different set throughput thresholds for different reasoning service copies of the same language model service according to the counted maximum throughput performance data.
In implementation, when GPU nodes with multiple specifications exist in the artificial intelligent cluster, the invention can test throughput performance aiming at language model service to be deployed, and count maximum throughput performance data of the language model service on the GPU nodes with different specifications, wherein different set throughput thresholds are defined for different reasoning service copies of the same language model service, so that GPU resources are optimally allocated and used.
It should be noted that, in the language model service scenario, based on the data center asset multiplexing principle, different service copies of the same language model service are allowed to run on GPU nodes with different specifications, and the maximum throughput of the language model service on the GPU nodes with different specifications is used as an elastic expansion and load balancing index, that is, when the same service is on nodes with different specifications, the expansion index and the request forwarding index of the same service are different.
Taking the example of testing the throughput of the language model service to be deployed on different numbers of acceleration cards of different specifications, wherein each specification can meet the minimum resource requirement of the reasoning task, such as the throughput on 8A 100 cardsThroughput on 8H 100 cards isThroughput on 4a 100 cards is
The obtained performance test data is configured to an inference service platform, and the resource specifications used for running the language model inference service are defined, for example, the tested several specifications are defined, namely 8H 100, 8A 100 and 4A 100, and an inference service resource scheduler can select a proper resource specification to deploy the language model inference service according to the cluster resource use condition. That is, the invention can deploy language model reasoning service by selecting resource specification meeting set condition through the reasoning service resource scheduler according to cluster resource use condition, and ensure the stability of reasoning operation.
The present invention can use the data in table 1 for resource scheduling and access policy management.
Table 1 performance test data
The method comprises the steps of selecting resource specifications meeting set conditions according to cluster resource use conditions by utilizing an inference service resource scheduler, and deploying language model inference service, wherein the method comprises the steps of screening nodes meeting memory requirements of a processor in a cluster according to a configured processor and memory parameters by utilizing the inference service resource scheduler, obtaining a resource specification list according to an inference service name, selecting the resource specification providing the maximum throughput from the resource specification list, judging whether the screened nodes meet the selected resource specification, selecting idle nodes in the screened nodes to deploy language model inference service if the screened nodes meet the selected resource specification, and selecting corresponding nodes to deploy language model inference service according to the ordering of the throughput if the screened nodes do not meet the selected resource specification.
It should be noted that, the language model service is deployed on the reasoning service platform, the parameters of the CPU and the memory are required to be set, but the number and the model of the GPU cards are not configured, and the reasoning service resource scheduler automatically selects the appropriate number of GPU cards. The CPU and the memory are configured to prevent misuse of the memory by the reasoning service, and the misuse of the CPU causes abnormal host nodes. The inference service platform performs scheduling according to cluster resources and inference service resource specifications, and the scheduling process can comprise the steps of firstly, screening nodes meeting CPU memory requirements in the cluster according to configured CPU and memory parameters by using an inference service scheduler. And then, obtaining usable resource specifications according to the reasoning service names, and selecting the resource specification capable of providing the maximum throughput to perform resource scheduling, namely automatically selecting the proper resource specification for scheduling based on the maximum throughput of the language model service at nodes with different specifications. Based on the strategy, the expansion times of the reasoning service can be reduced, more service accesses are directed to the same reasoning service copy, and under the limit condition, radixAttention (a method for optimizing the reasoning performance of the language model) can be used for caching in a mode of multiplexing key value cache to reduce redundant calculation, so that the reasoning service performance is improved.
When the free resources of the cluster cannot meet the selected resource specification, the resource specification providing less throughput is selected for scheduling, for example LLM-mode-2 in Table 1, whenIn the present invention, the resource specification selection sequence is 8H 100, 8 a100, 4 a100, that is, the corresponding nodes are selected according to the order of throughput from large to small. When the node still cannot be selected by using the resource specification with the minimum throughput for scheduling, the task scheduling fails, and the task is required to wait for other tasks to release resources and schedule again.
Further, in the method for managing the inference service provided by the embodiment of the invention, after the language model inference service is deployed, the method can further comprise the steps of recording the address of the inference service copy and configuring the address of the inference service copy in a load balancer.
Accordingly, forwarding the language model dialog to the corresponding inference service replica using the load balancer may specifically include forwarding the language model dialog to the corresponding inference service replica using the load balancer based on the address of the configured inference service replica.
In implementations, when the deployment of the inferencing service is complete, the IP address of the inferencing service replica can be recordedMaximum throughputAnd configuring the copy IP address in a load balancer arranged in the platform, automatically configuring and forwarding the load balancing service arranged in the platform, forwarding an external request to a node where the reasoning service is located, and providing the reasoning service to the outside.
Further, in the above-mentioned reasoning service management method provided by the embodiment of the present invention, step S102 may specifically include calculating, by using the reasoning service scaling component, a throughput sum of each language model session of each user in the reasoning service copy within a set duration, and calculating, by using the throughput sum and the set duration, a throughput average of each language model session of each user within the set duration.
In implementation, since each inference service provides multiple rounds of dialogue services for different users, throughput computation is generally performed for each dialogue of each user, so as to obtain a processing speed of a token of each dialogue, a statistical unit is token/s, and when throughput statistics is performed on the copy, throughput of each dialogue needs to be summarized. The invention can count the sum of the throughput of the access to the reasoning service copy in a set time period (such as 1 minute), calculate the average value of the throughput in the time period, and trigger the expansion of the reasoning service when the average value reaches a set threshold value. The invention can trigger the capacity expansion when the real-time throughput reaches 90% of the maximum throughput of the reasoning service copy, and the capacity of the reasoning service copy is contracted when no request is made in a certain period.
Taking LLM-mode-2 deployment in table 1 as an example in 8-card H100, the average throughput per session for each user is calculated for a set period of time (e.g., 1 minute), and when it is greater than 90% of the maximum throughput of the copy, the expansion operation is triggered. The 90% choice here is due to the difference in length between the user request and the model response, and the randomness of the number of tokens generated by the model in the Prompt (Prompt) and Generate (Generate) phases, resulting in an uncertainty in the amount of GPU resources occupied by a single request, thus requiring a 10% reserve of free space. The specific formula is as follows:
;
Wherein, The j-th session throughput of user i is represented, t represents a set duration, and t may be set to 60s.
It should be added that, for reasoning services running across other types of acceleration cards such as GPU, neural processing unit (Neural Processing Unit, NPU), etc., the present invention may also be used, that is, part of the copies run on GPU nodes, and part of the copies run on other types of acceleration card nodes, which will not be described herein.
In the above embodiments, the present invention further provides an embodiment corresponding to the reasoning service management apparatus and the reasoning service management device. It should be noted that the present invention describes an embodiment of the device portion from two angles, one based on the angle of the functional module and the other based on the angle of the hardware.
Fig. 4 is a schematic structural diagram of an inference service management apparatus according to an embodiment of the present invention. The embodiment is based on the angle of the functional module, and the device comprises:
the copy forwarding module 10 is configured to receive the language model dialogue and forward the language model dialogue to a corresponding reasoning service copy;
The data statistics module 11 is used for counting throughput data of the reasoning service copies in a set period when the language model reasoning service runs;
The copy capacity expansion module 12 is configured to perform reasoning service capacity expansion to obtain a capacity expansion copy when the counted throughput data exceeds a set throughput threshold;
An address configuration module 13, configured to configure the address of the expanded copy.
In the reasoning service management device provided by the embodiment of the invention, through interaction of the four modules, the language model dialogue can be forwarded to the corresponding reasoning service copy, throughput data of the reasoning service copy in a set period is counted, when the data exceeds a set throughput threshold, the reasoning service is expanded, an expanded copy is obtained, and an address of the expanded copy is configured. Therefore, the inference service expansion mechanism and the load balancing strategy configuration in the language model service scene can be optimized, throughput in the language model service is used as a basis for carrying out service expansion and load balancing selection, and when the throughput exceeds a threshold value, the expansion of the language model inference service is triggered, so that the allocation and use of resources are optimized, the resource utilization rate is improved, and the user experience is improved.
Since the embodiments of the apparatus portion and the embodiments of the method portion correspond to each other, the embodiments of the apparatus portion are referred to the description of the embodiments of the method portion, and are not repeated herein. And has the same advantageous effects as the above-mentioned reasoning service management method.
Fig. 5 is a schematic structural diagram of an inference service management apparatus according to an embodiment of the present invention. The present embodiment is based on a hardware point of view, and as shown in fig. 5, the inference service management apparatus includes:
a memory 20 for storing a computer program;
a processor 21 for implementing the steps of the inference service management method as mentioned in the above embodiments when executing a computer program.
Processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The Processor 21 may be implemented in at least one hardware form of a digital signal Processor (DIGITAL SIGNAL Processor, DSP), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 21 may also comprise a main processor, which is a processor for processing data in a wake-up state, also called CPU, and a co-processor, which is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a graphics processor (Graphics Processing Unit, GPU) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 21 may also include an artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) processor for processing computing operations related to machine learning.
The memory 20 may include one or more non-volatile storage media, which may be non-transitory. Memory 20 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing a computer program 201, which, when loaded and executed by the processor 21, is capable of implementing the relevant steps of the reasoning service management method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 20 may further include an operating system 202, data 203, and the like, where the storage manner may be transient storage or permanent storage. Operating system 202 may include Windows, unix, linux, among other things. The data 203 may include, but is not limited to, data related to the above-mentioned inference service management method, and the like.
In some embodiments, the inference service management device may further include a display 22, an input-output interface 23, a communication interface 24, a power supply 25, and a communication bus 26. Those skilled in the art will appreciate that the structure shown in fig. 5 does not constitute a limitation of the inference service management apparatus and may include more or less components than illustrated. The reasoning service management device provided by the embodiment of the invention comprises the memory and the processor, and the processor can realize the following method when executing the program stored in the memory, wherein the effects of the reasoning service management method are the same as those of the reasoning service management method.
Finally, the invention also provides a corresponding embodiment of the nonvolatile storage medium. The nonvolatile storage medium has stored thereon a computer program which, when executed by a processor, performs the steps described in the method embodiments described above.
It will be appreciated that the methods of the above embodiments, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored on a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium for performing all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes. The nonvolatile storage medium provided by the invention can realize the reasoning service management method, and the effects are the same as those of the reasoning service management method.
Finally, the invention also provides a corresponding embodiment of the computer program product. The computer program product comprises computer programs/instructions which when executed by the processor implement the steps as described in the embodiments of the reasoning service management method described above. The computer program product provided by the invention can realize the above-mentioned reasoning service management method, and the effects are the same as the above.
It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The inference service management method, the device, the medium and the computer program product provided by the invention are described in detail above. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.

Claims (15)

1.一种推理服务管理方法,其特征在于,用于包含负载均衡器,推理服务伸缩组件和推理服务资源调度器的推理服务平台,所述方法包括:1. A method for managing an inference service, characterized in that it is used for an inference service platform including a load balancer, an inference service scaling component and an inference service resource scheduler, the method comprising: 接收语言模型对话,利用所述负载均衡器将所述语言模型对话转发至相应的推理服务副本;receiving a language model conversation, and forwarding the language model conversation to a corresponding inference service replica using the load balancer; 在语言模型推理服务运行时,利用所述推理服务伸缩组件统计推理服务副本在设定周期内的吞吐量数据;When the language model inference service is running, the inference service scaling component is used to count the throughput data of the inference service replica within a set period; 当统计的吞吐量数据超过设定吞吐量阈值时,通过所述推理服务伸缩组件向所述推理服务资源调度器发送扩容通知信号,以使所述推理服务资源调度器进行推理服务扩容,得到扩容副本;When the statistical throughput data exceeds the set throughput threshold, an expansion notification signal is sent to the reasoning service resource scheduler through the reasoning service scaling component, so that the reasoning service resource scheduler performs reasoning service expansion and obtains an expanded copy; 利用所述负载均衡器配置所述扩容副本的地址。The load balancer is used to configure the address of the expanded replica. 2.根据权利要求1所述的推理服务管理方法,其特征在于,接收语言模型对话,利用所述负载均衡器将所述语言模型对话转发至相应的推理服务副本,包括:2. The inference service management method according to claim 1, characterized in that receiving the language model dialogue and forwarding the language model dialogue to the corresponding inference service replica using the load balancer comprises: 接收携带有目标对话标识符的第一语言模型对话;Receiving a first language model dialog carrying a target dialog identifier; 根据所述目标对话标识符确定对应的推理服务副本;Determine a corresponding reasoning service copy according to the target conversation identifier; 利用所述负载均衡器将所述第一语言模型对话转发至确定的推理服务副本。The first language model dialog is forwarded to the determined inference service replica using the load balancer. 3.根据权利要求2所述的推理服务管理方法,其特征在于,在接收携带有目标对话标识符的第一语言模型对话之前,还包括:3. The inference service management method according to claim 2, characterized in that before receiving the first language model dialogue carrying the target dialogue identifier, it also includes: 建立对话标识符与推理服务副本的关联关系;Establishing an association between the conversation identifier and the inference service copy; 根据所述目标对话标识符确定对应的推理服务副本,包括:Determining a corresponding reasoning service copy according to the target conversation identifier includes: 根据建立的对话标识符与推理服务副本的关联关系,结合所述目标对话标识符,确定与所述目标对话标识符关联的推理服务副本。According to the established association relationship between the dialog identifier and the inference service copy, in combination with the target dialog identifier, the inference service copy associated with the target dialog identifier is determined. 4.根据权利要求3所述的推理服务管理方法,其特征在于,建立对话标识符与推理服务副本的关联关系,包括:4. The method for managing reasoning services according to claim 3, wherein establishing an association relationship between a conversation identifier and a reasoning service copy comprises: 当首次创建语言模型对话时,随机生成对应的对话标识符;When a language model conversation is first created, a corresponding conversation identifier is randomly generated; 利用所述负载均衡器查询推理服务副本的吞吐量实时数据,选择吞吐量最小的推理服务副本;Utilize the load balancer to query the throughput real-time data of the inference service replicas, and select the inference service replica with the minimum throughput; 记录随机生成的对话标识符与选择的推理服务副本的地址之间的关系,以建立对话标识符与推理服务副本的关联关系。The relationship between the randomly generated conversation identifier and the address of the selected inference service copy is recorded to establish an association relationship between the conversation identifier and the inference service copy. 5.根据权利要求3所述的推理服务管理方法,其特征在于,还包括:5. The inference service management method according to claim 3, further comprising: 若确定的推理服务副本的吞吐量已达到最大值或推理服务副本不存在时,利用所述负载均衡器将所述第一语言模型对话转发至其他推理服务副本,并更新对话标识符与推理服务副本的关联关系。If the throughput of the determined inference service replica has reached the maximum value or the inference service replica does not exist, the load balancer is used to forward the first language model dialogue to other inference service replicas, and the association relationship between the dialogue identifier and the inference service replica is updated. 6.根据权利要求1所述的推理服务管理方法,其特征在于,接收语言模型对话,利用所述负载均衡器将所述语言模型对话转发至相应的推理服务副本,包括:6. The inference service management method according to claim 1, wherein receiving the language model dialogue and forwarding the language model dialogue to the corresponding inference service replica using the load balancer comprises: 接收未携带对话标识符的第二语言模型对话;receiving a second language model dialogue without a dialogue identifier; 利用所述负载均衡器选择实时吞吐量与最大吞吐量之间的比值最小的推理服务副本;Selecting, using the load balancer, an inference service replica having a minimum ratio between real-time throughput and maximum throughput; 将所述第二语言模型对话转发至选择的推理服务副本。The second language model dialog is forwarded to the selected inference service replica. 7.根据权利要求1所述的推理服务管理方法,其特征在于,还包括:7. The inference service management method according to claim 1, further comprising: 当统计的吞吐量数据为零时,通过所述推理服务伸缩组件向所述推理服务资源调度器发送缩容通知信号,以使所述推理服务资源调度器进行推理服务缩容,得到缩容副本;When the statistical throughput data is zero, sending a capacity reduction notification signal to the reasoning service resource scheduler through the reasoning service scaling component, so that the reasoning service resource scheduler performs capacity reduction of the reasoning service and obtains a reduced copy; 利用所述负载均衡器移出所述缩容副本的地址。The load balancer is used to remove the address of the reduced-capacity replica. 8.根据权利要求1所述的推理服务管理方法,其特征在于,在接收语言模型对话之前,还包括:8. The inference service management method according to claim 1, characterized in that before receiving the language model dialogue, it also includes: 当集群中存在多种规格的节点时,针对语言模型服务进行吞吐量性能测试,统计语言模型服务在不同规格节点的最大吞吐量性能数据;When there are nodes of different specifications in the cluster, perform throughput performance tests on the language model service and collect statistics on the maximum throughput performance data of the language model service on nodes of different specifications. 根据统计的最大吞吐量性能数据,对同一语言模型服务的不同推理服务副本定义不同的设定吞吐量阈值。Based on the statistical maximum throughput performance data, different throughput thresholds are defined for different inference service copies of the same language model service. 9.根据权利要求1所述的推理服务管理方法,其特征在于,在接收语言模型对话之前,还包括:9. The inference service management method according to claim 1, characterized in that before receiving the language model dialogue, it also includes: 利用所述推理服务资源调度器根据集群资源使用情况,选择满足设定条件的资源规格来部署语言模型推理服务。The inference service resource scheduler is used to select resource specifications that meet set conditions to deploy the language model inference service according to cluster resource usage. 10.根据权利要求9所述的推理服务管理方法,其特征在于,利用所述推理服务资源调度器根据集群资源使用情况,选择满足设定条件的资源规格来部署语言模型推理服务,包括:10. The inference service management method according to claim 9, characterized in that the inference service resource scheduler is used to select resource specifications that meet set conditions to deploy the language model inference service according to cluster resource usage, comprising: 利用所述推理服务资源调度器根据已配置的处理器和内存参数在集群中筛选满足处理器内存需求的节点;Using the inference service resource scheduler to select nodes in the cluster that meet the processor memory requirements according to the configured processor and memory parameters; 根据推理服务名称获得资源规格列表,并从所述资源规格列表中选择提供最大吞吐量的资源规格;Obtain a resource specification list according to the inference service name, and select a resource specification that provides the maximum throughput from the resource specification list; 判断筛选出的节点是否满足选择的资源规格;Determine whether the selected nodes meet the selected resource specifications; 若筛选出的节点满足选择的资源规格,则选择筛选出的节点中的空闲节点来部署语言模型推理服务;If the selected nodes meet the selected resource specifications, an idle node among the selected nodes is selected to deploy the language model inference service; 若筛选出的节点不满足选择的资源规格,则按照吞吐量大小的排序来选择相应的节点来部署语言模型推理服务。If the selected nodes do not meet the selected resource specifications, the corresponding nodes are selected according to the order of throughput size to deploy the language model inference service. 11.根据权利要求10所述的推理服务管理方法,其特征在于,在部署语言模型推理服务之后,还包括:11. The inference service management method according to claim 10, characterized in that after deploying the language model inference service, it also includes: 记录推理服务副本的地址,并在所述负载均衡器中配置推理服务副本的地址;Recording the address of the inference service replica, and configuring the address of the inference service replica in the load balancer; 利用所述负载均衡器将所述语言模型对话转发至相应的推理服务副本,包括:Forwarding the language model dialogue to a corresponding inference service replica using the load balancer includes: 利用所述负载均衡器根据配置的推理服务副本的地址将所述语言模型对话转发至相应的推理服务副本。The load balancer is used to forward the language model dialogue to the corresponding inference service replica according to the address of the configured inference service replica. 12.根据权利要求1所述的推理服务管理方法,其特征在于,利用所述推理服务伸缩组件统计推理服务副本在设定周期内的吞吐量数据,包括:12. The inference service management method according to claim 1, characterized in that the inference service scaling component is used to count the throughput data of the inference service replica within a set period, comprising: 利用所述推理服务伸缩组件计算推理服务副本中在设定时长内每个用户每个语言模型对话的吞吐量总和;Calculate the total throughput of each language model conversation of each user in the inference service replica within a set time period using the inference service scaling component; 利用所述吞吐量总和与所述设定时长,计算在设定时长内每个用户每个语言模型对话的吞吐量平均值。The throughput sum and the set duration are used to calculate the average throughput of each language model dialogue for each user within the set duration. 13.一种推理服务管理设备,其特征在于,所述设备包括:13. A reasoning service management device, characterized in that the device comprises: 存储器,用于存储计算机程序;Memory for storing computer programs; 处理器,用于执行所述计算机程序时实现如权利要求1至12任一项所述的推理服务管理方法的步骤。A processor, configured to implement the steps of the inference service management method according to any one of claims 1 to 12 when executing the computer program. 14.一种非易失性存储介质,其特征在于,所述非易失性存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至12任一项所述的推理服务管理方法的步骤。14. A non-volatile storage medium, characterized in that a computer program is stored on the non-volatile storage medium, and when the computer program is executed by a processor, the steps of the inference service management method according to any one of claims 1 to 12 are implemented. 15.一种计算机程序产品,包括计算机程序/指令,其特征在于,所述计算机程序/指令被处理器执行时实现如权利要求1至12任一项所述的推理服务管理方法的步骤。15. A computer program product, comprising a computer program/instruction, wherein when the computer program/instruction is executed by a processor, the steps of the inference service management method according to any one of claims 1 to 12 are implemented.
CN202510025516.XA 2025-01-08 2025-01-08 Reasoning service management method, device, medium and computer program product Pending CN119415273A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510025516.XA CN119415273A (en) 2025-01-08 2025-01-08 Reasoning service management method, device, medium and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510025516.XA CN119415273A (en) 2025-01-08 2025-01-08 Reasoning service management method, device, medium and computer program product

Publications (1)

Publication Number Publication Date
CN119415273A true CN119415273A (en) 2025-02-11

Family

ID=94462370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510025516.XA Pending CN119415273A (en) 2025-01-08 2025-01-08 Reasoning service management method, device, medium and computer program product

Country Status (1)

Country Link
CN (1) CN119415273A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119718878A (en) * 2025-02-28 2025-03-28 苏州元脑智能科技有限公司 Method, device and equipment for acquiring inference performance data of large model inference cluster
CN120358236A (en) * 2025-06-24 2025-07-22 苏州元脑智能科技有限公司 Load balancing method and device for reasoning service, electronic equipment and storage medium
CN120353516A (en) * 2025-06-25 2025-07-22 苏州元脑智能科技有限公司 Service starting and controlling method, electronic device, storage medium and program product
CN120743558A (en) * 2025-08-29 2025-10-03 苏州元脑智能科技有限公司 Inference service resource allocation method, electronic device and readable storage medium
CN121148746A (en) * 2025-11-17 2025-12-16 复旦大学附属妇产科医院 A Dynamic Inference Load Balancing Service Method for Doctor-Patient Dialogue Model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510322A (en) * 2022-02-16 2022-05-17 平安国际智慧城市科技股份有限公司 Pressure measurement control method and device of service cluster, computer equipment and medium
CN114745278A (en) * 2022-04-11 2022-07-12 中和农信项目管理有限公司 Method and device for expanding and contracting capacity of business system, electronic equipment and storage medium
KR20240152764A (en) * 2023-04-13 2024-10-22 주식회사 케이티 Method for scheduling gpu resource in ai service provision platform and apparatus thereof
CN118819841A (en) * 2024-07-02 2024-10-22 摩尔线程智能科技(北京)有限责任公司 Load balancing method, device and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114510322A (en) * 2022-02-16 2022-05-17 平安国际智慧城市科技股份有限公司 Pressure measurement control method and device of service cluster, computer equipment and medium
CN114745278A (en) * 2022-04-11 2022-07-12 中和农信项目管理有限公司 Method and device for expanding and contracting capacity of business system, electronic equipment and storage medium
KR20240152764A (en) * 2023-04-13 2024-10-22 주식회사 케이티 Method for scheduling gpu resource in ai service provision platform and apparatus thereof
CN118819841A (en) * 2024-07-02 2024-10-22 摩尔线程智能科技(北京)有限责任公司 Load balancing method, device and electronic equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119718878A (en) * 2025-02-28 2025-03-28 苏州元脑智能科技有限公司 Method, device and equipment for acquiring inference performance data of large model inference cluster
CN120358236A (en) * 2025-06-24 2025-07-22 苏州元脑智能科技有限公司 Load balancing method and device for reasoning service, electronic equipment and storage medium
CN120358236B (en) * 2025-06-24 2025-08-29 苏州元脑智能科技有限公司 Load balancing method and device for reasoning service, electronic equipment and storage medium
CN120353516A (en) * 2025-06-25 2025-07-22 苏州元脑智能科技有限公司 Service starting and controlling method, electronic device, storage medium and program product
CN120743558A (en) * 2025-08-29 2025-10-03 苏州元脑智能科技有限公司 Inference service resource allocation method, electronic device and readable storage medium
CN121148746A (en) * 2025-11-17 2025-12-16 复旦大学附属妇产科医院 A Dynamic Inference Load Balancing Service Method for Doctor-Patient Dialogue Model

Similar Documents

Publication Publication Date Title
CN119415273A (en) Reasoning service management method, device, medium and computer program product
US10257288B2 (en) System and method for throttling service requests having non-uniform workloads
CN110134495B (en) A container cross-host online migration method, storage medium and terminal device
US8275787B2 (en) System for managing data collection processes
CN104243405B (en) A kind of request processing method, apparatus and system
CN102426542B (en) Data center resource management system and job scheduling method
CN112015536A (en) Kubernetes cluster container group scheduling method, device and medium
Liu et al. An economical and SLO-guaranteed cloud storage service across multiple cloud service providers
KR102192442B1 (en) Balanced leader distribution method and system in kubernetes cluster
CN114090220B (en) Hierarchical CPU and memory resource scheduling method
CN114818454A (en) Model training method, data processing method, electronic device, and program product
CN108376103A (en) A kind of the equilibrium of stock control method and server of cloud platform
WO2021259246A1 (en) Resource scheduling method and apparatus, electronic device, and computer-readable storage medium
CN109358964B (en) Server cluster resource scheduling method
CN114448988A (en) A node load balancing method, apparatus, device, and storage medium
CN102984079B (en) Load balancing control method and system
CN110012058A (en) A kind of computing resource scheduling and improved method towards block chain
CN117271081B (en) Scheduling method, scheduling device and storage medium
CN119172250A (en) Cluster node scheduling method and device
CN110191362B (en) Data transmission method and device, storage medium and electronic equipment
CN115208891B (en) Hybrid cloud elastic scaling method, device, equipment, and storage medium
CN111740920B (en) A method and system for grayscale publishing and current limiting based on user token
CN118250287A (en) Transaction processing method and device
CN115914236A (en) Storage space allocation and adjustment method, device, electronic device and storage medium
CN117170861A (en) A cloud server microservice scheduling method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20250211

RJ01 Rejection of invention patent application after publication