CN114443302B

CN114443302B - Container cluster expansion method, system, terminal and storage medium

Info

Publication number: CN114443302B
Application number: CN202210098223.0A
Authority: CN
Inventors: 芮法玲
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2024-10-22
Anticipated expiration: 2042-01-27
Also published as: CN114443302A

Abstract

The present invention relates to the technical field of container clusters, and specifically provides a container cluster expansion method, system, terminal and storage medium, including: creating placeholder container nodes with the same specifications for various types of working container nodes, the placeholder container nodes are in a dormant state, and the types are two-dimensional classification results including specification types and application types; obtaining the type of abnormal working container nodes that cannot be scheduled, and locating similar working container nodes according to the type of the abnormal working container nodes, and obtaining specification parameters of similar working container nodes; searching for target placeholder container nodes with specifications not lower than the specification parameters from all placeholder container nodes, and releasing the resources occupied by the target placeholder container nodes to the abnormal working container nodes. The present invention can alleviate application pressure and ensure normal operation even when business pressure increases.

Description

Container cluster expansion method, system, terminal and storage medium

技术领域Technical Field

本发明涉及容器集群技术领域，具体涉及一种容器集群扩容方法、系统、终端及存储介质。The present invention relates to the technical field of container clusters, and in particular to a container cluster expansion method, system, terminal and storage medium.

背景技术Background Art

集群弹性伸缩通常的做法是对集群的资源使用情况进行监控，当集群负载较高时，启动弹性伸缩，对集群进行扩容，以缓解应用压力。这种做法需要单独部署监控组件以获取集群的资源使用情况。如果监控组件部署在集群内这本身也会造成对集群的压力，如果监控组件部署在集群外又会涉及到权限控制等一系列限制；其次，集群的资源使用率较低并不代表应用请求的资源少，所以根据集群资源使用率进行监控以出发弹性扩容并不是总是能达到根据应用使用情况进行扩容的目的，而以pod无法调度为契机触发扩容是个更加简单直观的有效的扩容触发方式，但当pod无法调度时才去扩容集群又会让pod一直等到集群扩容完成才能调度而延迟启动，如果业务重要这一段时间的延迟将造成非常不好的影响。The common practice of cluster elastic scaling is to monitor the resource usage of the cluster. When the cluster load is high, start elastic scaling and expand the cluster to relieve application pressure. This approach requires the deployment of monitoring components separately to obtain the resource usage of the cluster. If the monitoring component is deployed in the cluster, it will also cause pressure on the cluster. If the monitoring component is deployed outside the cluster, it will involve a series of restrictions such as permission control. Secondly, the low resource utilization rate of the cluster does not mean that the application requests fewer resources. Therefore, monitoring the cluster resource utilization rate to trigger elastic expansion does not always achieve the purpose of expanding according to application usage. Triggering expansion based on the inability to schedule pods is a simpler, more intuitive and effective way to trigger expansion. However, when the pod cannot be scheduled, the cluster will be expanded until the cluster expansion is completed before it can be scheduled, which will delay the start. If the business is important, this period of delay will have a very bad impact.

发明内容Summary of the invention

针对现有技术存在的监控组件占用资源以及当pod无法调度时导致扩容延迟pod启动，业务连续性差的问题，本发明提供一种容器集群扩容方法、系统、终端及存储介质，以解决上述技术问题。In view of the problems in the prior art that monitoring components occupy resources and that expansion delays pod startup when pods cannot be scheduled, resulting in poor business continuity, the present invention provides a container cluster expansion method, system, terminal and storage medium to solve the above technical problems.

第一方面，本发明提供一种容器集群扩容方法，包括：In a first aspect, the present invention provides a container cluster expansion method, comprising:

为多种种类的工作容器节点分别创建具有同等规格的占位容器节点，所述占位容器节点处于休眠状态，所述种类为包括规格种类和应用种类的二维分类结果；Creating placeholder container nodes with the same specifications for multiple types of working container nodes respectively, the placeholder container nodes are in a dormant state, and the types are two-dimensional classification results including specification types and application types;

获取无法调度的异常工作容器节点的所属种类，并根据所述异常工作容器节点的所属种类定位同类工作容器节点，获取同类工作容器节点的规格参数；Obtain the type of the abnormal working container node that cannot be scheduled, locate the same type of working container nodes according to the type of the abnormal working container node, and obtain the specification parameters of the same type of working container nodes;

从所有占位容器节点中查找规格不低于与所述规格参数的目标占位容器节点，将所述目标占位容器节点占用的资源释放给所述异常工作容器节点。A target placeholder container node whose specification is not less than the specification parameter is searched from all placeholder container nodes, and resources occupied by the target placeholder container node are released to the abnormal working container node.

进一步的，在为多种种类的工作容器节点分别创建具有同等规格的占位容器节点之前，所述方法还包括：Furthermore, before creating placeholder container nodes with the same specifications for the multiple types of working container nodes respectively, the method further includes:

将工作容器节点按照规格大小进行分类，并用规格种类标记工作容器节点，所述规格指工作容器节点占用的硬件资源；Classify the working container nodes according to their specifications and mark them with specifications, where the specifications refer to the hardware resources occupied by the working container nodes;

将工作容器节点按照应用分工进行分类，并用应用种类标记工作容器节点。Classify the working container nodes according to the application division of labor and mark the working container nodes with the application type.

进一步的，获取无法调度的异常工作容器节点的所属种类，并根据所述异常工作容器节点的所属种类定位同类工作容器节点，获取同类工作容器节点的规格参数，包括：Further, the type of the abnormal working container node that cannot be scheduled is obtained, and the same type of working container nodes are located according to the type of the abnormal working container node, and the specification parameters of the same type of working container nodes are obtained, including:

监控工作容器节点的状态，并将状态为无法调度的工作容器节点作为异常工作容器节点；Monitor the status of the working container nodes, and regard the working container nodes that cannot be scheduled as abnormal working container nodes;

获取所述异常工作容器节点的标记内容，并查找标记内容与所述异常工作容器节点一致的同类工作容器节点，所述标记内容包括规格种类和应用种类；Acquire the marking content of the abnormal working container node, and search for a similar working container node whose marking content is consistent with the abnormal working container node, wherein the marking content includes a specification type and an application type;

计算所述同类工作容器节点占用的硬件资源参数，并将所述硬件资源参数作为规格参数输出。The hardware resource parameters occupied by the same type of working container nodes are calculated, and the hardware resource parameters are output as specification parameters.

进一步的，从所有占位容器节点中查找规格不低于与所述规格参数的目标占位容器节点，将所述目标占位容器节点占用的资源释放给所述异常工作容器节点，包括：Further, searching for a target placeholder container node whose specification is not less than the specification parameter from all placeholder container nodes, and releasing resources occupied by the target placeholder container node to the abnormal working container node, including:

从调度队列的所有异常工作容器节点按请求资源从小到大进行排序，并按次序从中选取目标异常工作容器节点；All abnormal working container nodes in the scheduling queue are sorted from small to large according to the requested resources, and the target abnormal working container node is selected in order;

从所有占位容器节点中筛选出规格不低于所述目标异常工作容器节点的相应规格参数的多个待选占位容器节点；Filtering out a plurality of candidate placeholder container nodes whose specifications are not lower than corresponding specification parameters of the target abnormal working container node from all placeholder container nodes;

将待选占位容器节点按规格从小到大进行排序，并选取最靠前的占位容器节点作为待匹配占位容器节点；Sort the placeholder container nodes to be selected in ascending order of specifications, and select the placeholder container node at the front as the placeholder container node to be matched;

计算待匹配占位容器节点的实际硬件资源，并判断所述实际硬件资源是否不低于目标异常工作容器节点的请求资源大小：Calculate the actual hardware resources of the placeholder container node to be matched, and determine whether the actual hardware resources are not less than the requested resource size of the target abnormal working container node:

若是，则将所述目标异常工作容器节点迁移至所述待匹配占位容器节点的硬件资源，所述待匹配占位容器节点被高优先级的目标异常工作容器节点驱逐；If yes, the target abnormal working container node is migrated to the hardware resources of the to-be-matched placeholder container node, and the to-be-matched placeholder container node is expelled by the high-priority target abnormal working container node;

若否，则依次选取下一个待选占位容器节点作为待匹配占位容器节点。If not, the next placeholder container node to be selected is selected in sequence as the placeholder container node to be matched.

第二方面，本发明提供一种容器集群扩容系统，包括：In a second aspect, the present invention provides a container cluster expansion system, comprising:

占位创建单元，用于为多种种类的工作容器节点分别创建具有同等规格的占位容器节点，所述占位容器节点处于休眠状态，所述种类为包括规格种类和应用种类的二维分类结果；A placeholder creation unit, used to create placeholder container nodes with the same specifications for multiple types of working container nodes, respectively, the placeholder container nodes are in a dormant state, and the types are two-dimensional classification results including specification types and application types;

异常获取单元，用于获取无法调度的异常工作容器节点的所属种类，并根据所述异常工作容器节点的所属种类定位同类工作容器节点，获取同类工作容器节点的规格参数；An abnormal acquisition unit, used to acquire the type of the abnormal working container node that cannot be scheduled, locate the same type of working container node according to the type of the abnormal working container node, and acquire the specification parameters of the same type of working container node;

节点扩容单元，用于从所有占位容器节点中查找规格不低于与所述规格参数的目标占位容器节点，将所述目标占位容器节点占用的资源释放给所述异常工作容器节点。The node expansion unit is used to search for a target placeholder container node whose specification is not less than the specification parameter from all placeholder container nodes, and release the resources occupied by the target placeholder container node to the abnormal working container node.

进一步的，所述系统还包括：Furthermore, the system also includes:

第一分类模块，用于将工作容器节点按照规格大小进行分类，并用规格种类标记工作容器节点，所述规格指工作容器节点占用的硬件资源；A first classification module is used to classify the working container nodes according to the size of the specification and mark the working container nodes with the specification type, where the specification refers to the hardware resources occupied by the working container nodes;

第二分类模块，用于将工作容器节点按照应用分工进行分类，并用应用种类标记工作容器节点。The second classification module is used to classify the working container nodes according to the application division of labor and mark the working container nodes with the application type.

进一步的，所述异常获取单元包括：Furthermore, the abnormality acquisition unit includes:

异常监控模块，用于监控工作容器节点的状态，并将状态为无法调度的工作容器节点作为异常工作容器节点；An abnormal monitoring module is used to monitor the status of the working container node and regard the working container node that cannot be scheduled as an abnormal working container node;

标记查找模块，用于获取所述异常工作容器节点的标记内容，并查找标记内容与所述异常工作容器节点一致的同类工作容器节点，所述标记内容包括规格种类和应用种类；A tag search module, used to obtain the tag content of the abnormal working container node, and search for a similar working container node whose tag content is consistent with the abnormal working container node, wherein the tag content includes a specification type and an application type;

规格获取模块，用于计算所述同类工作容器节点占用的硬件资源参数，并将所述硬件资源参数作为规格参数输出。The specification acquisition module is used to calculate the hardware resource parameters occupied by the same type of working container nodes and output the hardware resource parameters as specification parameters.

进一步的，所述节点扩容单元包括：Furthermore, the node expansion unit includes:

调度排序模块，用于从调度队列的所有异常工作容器节点按请求资源从小到大进行排序，并按次序从中选取目标异常工作容器节点；A scheduling sorting module is used to sort all abnormal working container nodes in the scheduling queue from small to large according to the requested resources, and select the target abnormal working container node in order;

待选筛查模块，用于从所有占位容器节点中筛选出规格不低于所述目标异常工作容器节点的相应规格参数的多个待选占位容器节点；A candidate screening module is used to screen out a plurality of candidate placeholder container nodes whose specifications are not lower than corresponding specification parameters of the target abnormal working container node from all placeholder container nodes;

占位排序模块，用于将待选占位容器节点按规格从小到大进行排序，并选取最靠前的占位容器节点作为待匹配占位容器节点；A placeholder sorting module is used to sort the placeholder container nodes to be selected from small to large specifications, and select the placeholder container node at the front as the placeholder container node to be matched;

资源判断模块，用于计算待匹配占位容器节点的实际硬件资源，并判断所述实际硬件资源是否不低于目标异常工作容器节点的请求资源大小：The resource judgment module is used to calculate the actual hardware resources of the placeholder container node to be matched, and to judge whether the actual hardware resources are not less than the requested resource size of the target abnormal working container node:

节点迁移模块，用于若所述资源判断模块的判断结果为是，则将所述目标异常工作容器节点迁移至所述待匹配占位容器节点的硬件资源，所述待匹配占位容器节点被高优先级的目标异常工作容器节点驱逐；A node migration module, configured to migrate the target abnormal working container node to the hardware resources of the to-be-matched placeholder container node if the judgment result of the resource judgment module is yes, and the to-be-matched placeholder container node is expelled by the target abnormal working container node with a high priority;

目标重选模块，用于若所述资源判断模块的判断结果为否，则依次选取下一个待选占位容器节点作为待匹配占位容器节点。The target reselection module is used to select the next placeholder container node to be selected as the placeholder container node to be matched if the judgment result of the resource judgment module is no.

第三方面，提供一种终端，包括：In a third aspect, a terminal is provided, including:

处理器、存储器，其中，processor, memory, wherein:

该存储器用于存储计算机程序，The memory is used to store computer programs.

该处理器用于从存储器中调用并运行该计算机程序，使得终端执行上述的终端的方法。The processor is used to call and run the computer program from the memory, so that the terminal executes the above-mentioned terminal method.

第四方面，提供了一种计算机存储介质，所述计算机可读存储介质中存储有指令，当其在计算机上运行时，使得计算机执行上述各方面所述的方法。According to a fourth aspect, a computer storage medium is provided, wherein the computer-readable storage medium stores instructions, and when the instructions are executed on a computer, the computer executes the methods described in the above aspects.

本发明的有益效果在于，本发明提供的容器集群扩容方法、系统、终端及存储介质，以快速应对业务量激增等带来的k8s集群中应用副本数增长或新增应用等导致的应用资源的增长所导致的基础设施资源不足的情况，以缓解应用压力，确保业务压力增长时也可以正常运行。The beneficial effect of the present invention is that the container cluster expansion method, system, terminal and storage medium provided by the present invention can quickly respond to the insufficient infrastructure resources caused by the increase in the number of application copies in the k8s cluster due to the surge in business volume or the increase in application resources caused by the addition of new applications, so as to alleviate the application pressure and ensure normal operation even when the business pressure increases.

此外，本发明设计原理可靠，结构简单，具有非常广泛的应用前景。In addition, the invention has a reliable design principle, a simple structure and a very broad application prospect.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，对于本领域普通技术人员而言，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, for ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1是本发明一个实施例的方法的示意性流程图。FIG1 is a schematic flow chart of a method according to an embodiment of the present invention.

图2是本发明一个实施例的系统的示意性框图。FIG. 2 is a schematic block diagram of a system according to an embodiment of the present invention.

图3为本发明实施例提供的一种终端的结构示意图。FIG3 is a schematic diagram of the structure of a terminal provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了使本技术领域的人员更好地理解本发明中的技术方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the technical solutions in the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of the present invention.

下面对本发明中出现的关键术语进行解释。The key terms appearing in the present invention are explained below.

Kubernetes(k8s)Google开源的容器集群管理系统，为容器化的应用提供部署运行、资源调度、服务发现和动态伸缩等一系列完整功能，提高了大规模容器集群管理的便捷性Kubernetes (k8s) is an open source container cluster management system from Google. It provides a series of complete functions such as deployment and operation, resource scheduling, service discovery, and dynamic scaling for containerized applications, which improves the convenience of large-scale container cluster management.

Pod是Kubernetes调度的最小单元。一个Pod可以包含一个或多个容器，因此它可以被看作是内部容器的逻辑宿主机，本发明中的容器节点即为Pod。Pod is the smallest unit of Kubernetes scheduling. A Pod can contain one or more containers, so it can be regarded as a logical host of internal containers. The container node in the present invention is Pod.

图1是本发明一个实施例的方法的示意性流程图。其中，图1执行主体可以为一种容器集群扩容系统。Fig. 1 is a schematic flow chart of a method according to an embodiment of the present invention, wherein the execution subject of Fig. 1 may be a container cluster expansion system.

如图1所示，该方法包括：As shown in FIG1 , the method includes:

步骤110，为多种种类的工作容器节点分别创建具有同等规格的占位容器节点，所述占位容器节点处于休眠状态，所述种类为包括规格种类和应用种类的二维分类结果；Step 110, creating placeholder container nodes with the same specifications for the working container nodes of multiple types, respectively, the placeholder container nodes are in a dormant state, and the types are two-dimensional classification results including specification types and application types;

步骤120，获取无法调度的异常工作容器节点的所属种类，并根据所述异常工作容器节点的所属种类定位同类工作容器节点，获取同类工作容器节点的规格参数；Step 120, obtaining the type of the abnormal working container node that cannot be scheduled, locating the same type of working container nodes according to the type of the abnormal working container node, and obtaining the specification parameters of the same type of working container nodes;

步骤130，从所有占位容器节点中查找规格不低于与所述规格参数的目标占位容器节点，将所述目标占位容器节点占用的资源释放给所述异常工作容器节点。Step 130 : searching for a target placeholder container node whose specification is not less than the specification parameter from all placeholder container nodes, and releasing the resources occupied by the target placeholder container node to the abnormal working container node.

为了便于对本发明的理解，下面以本发明容器集群扩容方法的原理，结合实施例中对容器集群进行扩容的过程，对本发明提供的容器集群扩容方法做进一步的描述。To facilitate understanding of the present invention, the container cluster expansion method provided by the present invention is further described below based on the principle of the container cluster expansion method of the present invention and the process of expanding the container cluster in the embodiment.

具体的，所述容器集群扩容方法包括：Specifically, the container cluster expansion method includes:

S1、为多种种类的工作容器节点分别创建具有同等规格的占位容器节点，所述占位容器节点处于休眠状态，所述种类为包括规格种类和应用种类的二维分类结果。S1. Create placeholder container nodes with the same specifications for multiple types of working container nodes respectively. The placeholder container nodes are in a dormant state. The types are two-dimensional classification results including specification types and application types.

将集群中的工作容器节点按照资源规格和应用分工两个维度，划分为不同的节点组。首先按照规格进行划分；然后按照应用分工划分，可以按应用重要程度简单分为normal,important,critical,也可以按照应用分为app1,app2,app3；在划分第二个维度时，为归属不同节点组的节点打上不同的标签。The working container nodes in the cluster are divided into different node groups according to the two dimensions of resource specifications and application division of labor. First, they are divided according to specifications; then they are divided according to application division of labor. They can be simply divided into normal, important, critical according to the importance of applications, or divided into app1, app2, and app3 according to applications; when dividing the second dimension, different labels are added to nodes belonging to different node groups.

节点组实例：[(s1.large:critical)、(s1.large:normal)、(s1.Medium:critical)、(s2.medium:normal)],Node group example: [(s1.large:critical), (s1.large:normal), (s1.Medium:critical), (s2.medium:normal)],

为集群中的应用进行分类，并为应用添加对应的标签，如critical,normal。Classify the applications in the cluster and add corresponding labels to the applications, such as critical and normal.

为需要及时响应扩容要求，配置占位容器节点，如App1需要冗余2个副本的资源，App2需要冗余1个副本资源，App3不需配置冗余。占位容器节点是个只占用资源不使用资源的pod，那么对于App1其占位装置是2个与App1中请求资源相同的永久休眠的pod，为不同应用占用资源的占位pod打上不通的标签。因为占位装置的本质是一个pod,因此它可以在无法调度时触发扩容来实现为集群过度配置节点的需求，又因其永久休眠所以该pod并不真正使用节点的资源；使用优先级和抢占，来实现创建真正的Pod后驱逐“占位”的Pod，使用PodPriorityClass在为pod配置优先级时为占位的pod配置比工作Pod更低的优先级。In order to respond to the expansion requirements in time, placeholder container nodes are configured. For example, App1 needs 2 redundant copies of resources, App2 needs 1 redundant copy of resources, and App3 does not need to be configured with redundancy. The placeholder container node is a pod that only occupies resources but does not use them. For App1, its placeholder device is 2 permanently dormant pods with the same resources requested in App1. Different labels are given to the placeholder pods that occupy resources for different applications. Because the placeholder device is essentially a pod, it can trigger expansion when it cannot be scheduled to meet the needs of over-configuring nodes for the cluster. Because it is permanently dormant, the pod does not actually use the node's resources. Priority and preemption are used to create the real Pod and then evict the "placeholder" Pod. When configuring the priority for the pod, use PodPriorityClass to configure a lower priority for the placeholder pod than the working Pod.

S2、获取无法调度的异常工作容器节点的所属种类，并根据所述异常工作容器节点的所属种类定位同类工作容器节点，获取同类工作容器节点的规格参数。S2. Obtain the type of the abnormal working container node that cannot be scheduled, locate the same type of working container nodes according to the type of the abnormal working container node, and obtain specification parameters of the same type of working container nodes.

监控工作容器节点的状态，并将状态为无法调度的工作容器节点作为异常工作容器节点；获取所述异常工作容器节点的标记内容，并查找标记内容与所述异常工作容器节点一致的同类工作容器节点，所述标记内容包括规格种类和应用种类；计算所述同类工作容器节点占用的硬件资源参数，并将所述硬件资源参数作为规格参数输出。Monitor the status of the working container nodes, and treat the working container nodes that are in an unschedulable status as abnormal working container nodes; obtain the marking content of the abnormal working container nodes, and search for similar working container nodes whose marking content is consistent with the abnormal working container nodes, wherein the marking content includes specification type and application type; calculate the hardware resource parameters occupied by the similar working container nodes, and output the hardware resource parameters as specification parameters.

例如，异常工作容器节点的标签是(s1.large:critical)，App3：normal，则查找与异常工作容器节点具有相同标签的工作容器节点，查找到的工作容器节点与异常工作容器节点属于同类。获取工作容器节点占用的硬件资源参数即为规格参数。For example, if the label of the abnormal working container node is (s1.large:critical), App3:normal, then the working container node with the same label as the abnormal working container node is searched, and the found working container node and the abnormal working container node belong to the same category. The hardware resource parameters occupied by the working container node are obtained as the specification parameters.

S3、从所有占位容器节点中查找规格不低于与所述规格参数的目标占位容器节点，将所述目标占位容器节点占用的资源释放给所述异常工作容器节点。S3: Search for a target placeholder container node whose specification is not less than the specification parameter from all placeholder container nodes, and release the resources occupied by the target placeholder container node to the abnormal working container node.

从调度队列的所有异常工作容器节点按请求资源从小到大进行排序，并按次序从中选取目标异常工作容器节点；从所有占位容器节点中筛选出规格不低于所述目标异常工作容器节点的相应规格参数的多个待选占位容器节点；将待选占位容器节点按规格从小到大进行排序，并选取最靠前的占位容器节点作为待匹配占位容器节点；计算待匹配占位容器节点的实际硬件资源，并判断所述实际硬件资源是否不低于目标异常工作容器节点的请求资源大小：若是，则将所述目标异常工作容器节点迁移至所述待匹配占位容器节点的硬件资源，所述待匹配占位容器节点被高优先级的目标异常工作容器节点驱逐；若否，则依次选取下一个待选占位容器节点作为待匹配占位容器节点。All abnormal working container nodes in the scheduling queue are sorted in ascending order according to the requested resources, and a target abnormal working container node is selected therefrom in order; a plurality of placeholder container nodes to be selected whose specifications are not less than the corresponding specification parameters of the target abnormal working container node are screened out from all placeholder container nodes; the placeholder container nodes to be selected are sorted in ascending order according to the specifications, and the front placeholder container node is selected as the placeholder container node to be matched; the actual hardware resources of the placeholder container node to be matched are calculated, and it is determined whether the actual hardware resources are not less than the requested resource size of the target abnormal working container node: if so, the target abnormal working container node is migrated to the hardware resources of the placeholder container node to be matched, and the placeholder container node to be matched is expelled by the target abnormal working container node with a high priority; if not, the next placeholder container node to be selected is selected in sequence as the placeholder container node to be matched.

例如，将所有无法调度的pod按照其所请求的资源(CPU，memory)中的CPU的大小从大到小排序，试图为每个无法调度的pod寻找到合适的占位容器节点进行扩容。首先按应用的标签查找对应的占位容器节点，对应的占位容器节点可能有多个，依次从最小规格的占位容器节点模板开始，假定新增一个这样的节点是否可以容纳当前的pod,如果所有占位容器节点模板都不可用，那么放弃这个pod；如果可以，那么将占位容器节点模板规格作为一个新的节点数据加到集群状态快照数据中去，并计算剩下的pod是否可以放下，如此循环尝试为所有的未调度pod找到合适的占位容器节点；如果多个占位容器节点都可以满足需要，那么选择扩容节点数最小的占位容器节点。For example, all unschedulable pods are sorted from large to small according to the CPU size in the resources (CPU, memory) they request, and an attempt is made to find a suitable placeholder container node for each unschedulable pod for expansion. First, the corresponding placeholder container node is searched according to the application label. There may be multiple corresponding placeholder container nodes. Starting from the placeholder container node template with the smallest specification, it is assumed that a new such node can accommodate the current pod. If all placeholder container node templates are unavailable, the pod is abandoned; if it can, the placeholder container node template specification is added as a new node data to the cluster status snapshot data, and the remaining pods are calculated to see if they can be placed. This cycle attempts to find suitable placeholder container nodes for all unscheduled pods; if multiple placeholder container nodes can meet the needs, the placeholder container node with the smallest number of expansion nodes is selected.

在容器集群中通过增加节点的方式扩容集群的容量以容纳更多的应用或者为应用提供更多的副本以应对大的流量压力。扩容增加节点的时间取决于云供应商，这个时间通常不会低于3分钟，并且这个时间会随着集群的增大越来越长。对于普通的应用当集群无法容纳足够多的副本时对集群增加节点扩容或许没有什么影响，但是对于重要的、需要及时响应的应用，当需要增加副本却没有主机资源需要先去对集群扩容时，扩容的时间往往是不可忍受的，本发明以预占资源的方式，为重要的应用总是提前为集群准备比现有业务需求多的基础设施资源，以减少基础资源不足造成的应用阻塞。In a container cluster, the capacity of the cluster is expanded by adding nodes to accommodate more applications or provide more replicas for applications to cope with large traffic pressure. The time for expanding and adding nodes depends on the cloud provider. This time is usually not less than 3 minutes, and this time will become longer and longer as the cluster grows. For ordinary applications, when the cluster cannot accommodate enough replicas, adding nodes to the cluster may have no effect, but for important applications that require timely response, when replicas need to be added but there are no host resources to expand the cluster first, the expansion time is often unbearable. The present invention always prepares more infrastructure resources for the cluster than the existing business needs in advance for important applications in a way of pre-occupying resources, so as to reduce application blocking caused by insufficient basic resources.

如图2所示，该系统200包括：As shown in FIG. 2 , the system 200 includes:

占位创建单元210，用于为多种种类的工作容器节点分别创建具有同等规格的占位容器节点，所述占位容器节点处于休眠状态，所述种类为包括规格种类和应用种类的二维分类结果；A placeholder creation unit 210 is used to create placeholder container nodes with the same specifications for multiple types of working container nodes, respectively, the placeholder container nodes are in a dormant state, and the types are two-dimensional classification results including specification types and application types;

异常获取单元220，用于获取无法调度的异常工作容器节点的所属种类，并根据所述异常工作容器节点的所属种类定位同类工作容器节点，获取同类工作容器节点的规格参数；The abnormal acquisition unit 220 is used to acquire the type of the abnormal working container node that cannot be scheduled, locate the same type of working container node according to the type of the abnormal working container node, and acquire the specification parameters of the same type of working container node;

节点扩容单元230，用于从所有占位容器节点中查找规格不低于与所述规格参数的目标占位容器节点，将所述目标占位容器节点占用的资源释放给所述异常工作容器节点。The node expansion unit 230 is configured to search for a target placeholder container node whose specification is not less than the specification parameter from all placeholder container nodes, and release resources occupied by the target placeholder container node to the abnormal working container node.

可选地，作为本发明一个实施例，所述系统还包括：Optionally, as an embodiment of the present invention, the system further includes:

可选地，作为本发明一个实施例，所述异常获取单元包括：Optionally, as an embodiment of the present invention, the abnormality obtaining unit includes:

可选地，作为本发明一个实施例，所述节点扩容单元包括：Optionally, as an embodiment of the present invention, the node expansion unit includes:

图3为本发明实施例提供的一种终端300的结构示意图，该终端300可以用于执行本发明实施例提供的容器集群扩容方法。FIG3 is a schematic diagram of the structure of a terminal 300 provided in an embodiment of the present invention. The terminal 300 may be used to execute the container cluster expansion method provided in an embodiment of the present invention.

其中，该终端300可以包括：处理器310、存储器320及通信单元330。这些组件通过一条或多条总线进行通信，本领域技术人员可以理解，图中示出的服务器的结构并不构成对本发明的限定，它既可以是总线形结构，也可以是星型结构，还可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。The terminal 300 may include: a processor 310, a memory 320 and a communication unit 330. These components communicate via one or more buses. Those skilled in the art will appreciate that the server structure shown in the figure does not limit the present invention, and it may be a bus structure or a star structure, and may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently.

其中，该存储器320可以用于存储处理器310的执行指令，存储器320可以由任何类型的易失性或非易失性存储终端或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。当存储器320中的执行指令由处理器310执行时，使得终端300能够执行以下上述方法实施例中的部分或全部步骤。The memory 320 can be used to store the execution instructions of the processor 310, and the memory 320 can be implemented by any type of volatile or non-volatile storage terminal or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk. When the execution instructions in the memory 320 are executed by the processor 310, the terminal 300 can perform some or all of the steps in the following method embodiments.

处理器310为存储终端的控制中心，利用各种接口和线路连接整个电子终端的各个部分，通过运行或执行存储在存储器320内的软件程序和/或模块，以及调用存储在存储器内的数据，以执行电子终端的各种功能和/或处理数据。所述处理器可以由集成电路(Integrated Circuit，简称IC)组成，例如可以由单颗封装的IC所组成，也可以由连接多颗相同功能或不同功能的封装IC而组成。举例来说，处理器310可以仅包括中央处理器(Central Processing Unit，简称CPU)。在本发明实施方式中，CPU可以是单运算核心，也可以包括多运算核心。The processor 310 is the control center of the storage terminal, and uses various interfaces and lines to connect various parts of the entire electronic terminal. It runs or executes software programs and/or modules stored in the memory 320, and calls data stored in the memory to perform various functions of the electronic terminal and/or process data. The processor can be composed of an integrated circuit (IC), for example, it can be composed of a single packaged IC, or it can be composed of a plurality of packaged ICs with the same or different functions. For example, the processor 310 can include only a central processing unit (CPU). In the embodiment of the present invention, the CPU can be a single computing core or multiple computing cores.

通信单元330，用于建立通信信道，从而使所述存储终端可以与其它终端进行通信。接收其他终端发送的用户数据或者向其他终端发送用户数据。The communication unit 330 is used to establish a communication channel so that the storage terminal can communicate with other terminals, receive user data sent by other terminals or send user data to other terminals.

本发明还提供一种计算机存储介质，其中，该计算机存储介质可存储有程序，该程序执行时可包括本发明提供的各实施例中的部分或全部步骤。所述的存储介质可为磁碟、光盘、只读存储记忆体(英文：read-only memory，简称：ROM)或随机存储记忆体(英文：random access memory，简称：RAM)等。The present invention also provides a computer storage medium, wherein the computer storage medium may store a program, and when the program is executed, the program may include some or all of the steps in each embodiment provided by the present invention. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM).

因此，本发明以快速应对业务量激增等带来的k8s集群中应用副本数增长或新增应用等导致的应用资源的增长所导致的基础设施资源不足的情况，以缓解应用压力，确保业务压力增长时也可以正常运行，本实施例所能达到的技术效果可以参见上文中的描述，此处不再赘述。Therefore, the present invention is to quickly respond to the situation of insufficient infrastructure resources caused by the increase in the number of application copies in the k8s cluster due to a surge in business volume or the increase in application resources caused by new applications, so as to alleviate application pressure and ensure normal operation even when business pressure increases. The technical effects that can be achieved by this embodiment can be found in the description above and will not be repeated here.

本领域的技术人员可以清楚地了解到本发明实施例中的技术可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本发明实施例中的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中如U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质，包括若干指令用以使得一台计算机终端(可以是个人计算机，服务器，或者第二终端、网络终端等)执行本发明各个实施例所述方法的全部或部分步骤。Those skilled in the art can clearly understand that the technology in the embodiments of the present invention can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution in the embodiments of the present invention, in essence or in other words, the part that contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium such as a USB flash drive, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a disk or an optical disk, and other media that can store program codes, including several instructions for enabling a computer terminal (which can be a personal computer, a server, or a second terminal, a network terminal, etc.) to perform all or part of the steps of the methods described in each embodiment of the present invention.

本说明书中各个实施例之间相同相似的部分互相参见即可。尤其，对于终端实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例中的说明即可。In this specification, the same or similar parts between the various embodiments can be referred to each other. In particular, for the terminal embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description in the method embodiment.

在本发明所提供的几个实施例中，应该理解到，所揭露的系统和方法，可以通过其它的方式实现。例如，以上所描述的系统实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，系统或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods can be implemented in other ways. For example, the system embodiments described above are only schematic. For example, the division of the units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of systems or units, which can be electrical, mechanical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

尽管通过参考附图并结合优选实施例的方式对本发明进行了详细描述，但本发明并不限于此。在不脱离本发明的精神和实质的前提下，本领域普通技术人员可以对本发明的实施例进行各种等效的修改或替换，而这些修改或替换都应在本发明的涵盖范围内/任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应所述以权利要求的保护范围为准。Although the present invention has been described in detail by referring to the accompanying drawings and in combination with the preferred embodiments, the present invention is not limited thereto. Without departing from the spirit and essence of the present invention, a person of ordinary skill in the art may make various equivalent modifications or substitutions to the embodiments of the present invention, and these modifications or substitutions shall be within the scope of the present invention. Any person skilled in the art who is familiar with the present technical field can easily think of changes or substitutions within the technical scope disclosed by the present invention, and all of these shall be within the scope of protection of the present invention. Therefore, the scope of protection of the present invention shall be subject to the scope of protection of the claims.

Claims

1. A container cluster expansion method, characterized by comprising:

Creating placeholder container nodes with the same specifications for multiple types of working container nodes respectively, the placeholder container nodes are in a dormant state, and the types are two-dimensional classification results including specification types and application types;

Obtain the type of the abnormal working container node that cannot be scheduled, locate the same type of working container nodes according to the type of the abnormal working container node, and obtain the specification parameters of the same type of working container nodes;

Searching for a target placeholder container node whose specification is not less than the specification parameter from all placeholder container nodes, and releasing resources occupied by the target placeholder container node to the abnormal working container node;

Searching for a target placeholder container node whose specification is not less than the specification parameter from all placeholder container nodes, and releasing resources occupied by the target placeholder container node to the abnormal working container node, including:

All abnormal working container nodes in the scheduling queue are sorted from small to large according to the requested resources, and the target abnormal working container node is selected in order;

Filtering out a plurality of candidate placeholder container nodes whose specifications are not lower than corresponding specification parameters of the target abnormal working container node from all placeholder container nodes;

Sort the placeholder container nodes to be selected in ascending order of specifications, and select the placeholder container node at the front as the placeholder container node to be matched;

Calculate the actual hardware resources of the placeholder container node to be matched, and determine whether the actual hardware resources are not less than the requested resource size of the target abnormal working container node:

If yes, the target abnormal working container node is migrated to the hardware resources of the to-be-matched placeholder container node, and the to-be-matched placeholder container node is expelled by the high-priority target abnormal working container node;

If not, the next placeholder container node to be selected is selected in sequence as the placeholder container node to be matched.

2. The method according to claim 1, characterized in that before creating placeholder container nodes with the same specifications for the multiple types of working container nodes respectively, the method further comprises:

Classify the working container nodes according to their specifications and mark them with specifications, where the specifications refer to the hardware resources occupied by the working container nodes;

Classify the working container nodes according to the application division of labor and mark the working container nodes with the application type.

3. The method according to claim 2 is characterized in that the type of the abnormal working container node that cannot be scheduled is obtained, and the same type of working container nodes are located according to the type of the abnormal working container node, and the specification parameters of the same type of working container nodes are obtained, including:

Monitor the status of the working container nodes, and regard the working container nodes that cannot be scheduled as abnormal working container nodes;

Acquire the marking content of the abnormal working container node, and search for a similar working container node whose marking content is consistent with the abnormal working container node, wherein the marking content includes a specification type and an application type;

The hardware resource parameters occupied by the same type of working container nodes are calculated, and the hardware resource parameters are output as specification parameters.

4. A container cluster expansion system, characterized by comprising:

A placeholder creation unit, used to create placeholder container nodes with the same specifications for multiple types of working container nodes, respectively, the placeholder container nodes are in a dormant state, and the types are two-dimensional classification results including specification types and application types;

An abnormal acquisition unit, used to acquire the type of the abnormal working container node that cannot be scheduled, locate the same type of working container node according to the type of the abnormal working container node, and acquire the specification parameters of the same type of working container node;

A node expansion unit, configured to search for a target placeholder container node whose specification is not less than the specification parameter from all placeholder container nodes, and release resources occupied by the target placeholder container node to the abnormal working container node;

The node expansion unit includes:

A scheduling sorting module is used to sort all abnormal working container nodes in the scheduling queue from small to large according to the requested resources, and select the target abnormal working container node in order;

A candidate screening module is used to screen out a plurality of candidate placeholder container nodes whose specifications are not lower than corresponding specification parameters of the target abnormal working container node from all placeholder container nodes;

A placeholder sorting module is used to sort the placeholder container nodes to be selected from small to large specifications, and select the placeholder container node at the front as the placeholder container node to be matched;

The resource judgment module is used to calculate the actual hardware resources of the placeholder container node to be matched, and to judge whether the actual hardware resources are not less than the requested resource size of the target abnormal working container node:

A node migration module, configured to migrate the target abnormal working container node to the hardware resources of the to-be-matched placeholder container node if the judgment result of the resource judgment module is yes, and the to-be-matched placeholder container node is expelled by the target abnormal working container node with a high priority;

The target reselection module is used to select the next placeholder container node to be selected as the placeholder container node to be matched if the judgment result of the resource judgment module is no.

5. The system according to claim 4, characterized in that the system further comprises:

A first classification module is used to classify the working container nodes according to the size of the specification and mark the working container nodes with the specification type, where the specification refers to the hardware resources occupied by the working container nodes;

The second classification module is used to classify the working container nodes according to the application division of labor and mark the working container nodes with the application type.

6. The system according to claim 5, characterized in that the abnormality acquisition unit comprises:

An abnormal monitoring module is used to monitor the status of the working container node and regard the working container node that cannot be scheduled as an abnormal working container node;

A tag search module, used to obtain the tag content of the abnormal working container node, and search for a similar working container node whose tag content is consistent with the abnormal working container node, wherein the tag content includes a specification type and an application type;

The specification acquisition module is used to calculate the hardware resource parameters occupied by the same type of working container nodes and output the hardware resource parameters as specification parameters.

7. A terminal, comprising:

processor;

A memory for storing execution instructions of the processor;

The processor is configured to execute the method according to any one of claims 1 to 3.

8. A computer-readable storage medium storing a computer program, wherein when the program is executed by a processor, the method according to any one of claims 1 to 3 is implemented.