CN119383087B

CN119383087B - A method and system for real-time monitoring and operation of horizontal network performance on a microservice cloud platform

Info

Publication number: CN119383087B
Application number: CN202411520209.0A
Authority: CN
Inventors: 钟掖; 张光益; 王策; 李洵; 汤杰; 丁群峰
Original assignee: Guizhou Power Grid Co Ltd
Current assignee: Guizhou Power Grid Co Ltd
Priority date: 2024-10-29
Filing date: 2024-10-29
Publication date: 2025-10-28
Anticipated expiration: 2044-10-29
Also published as: CN119383087A

Abstract

The invention discloses a method and a system for monitoring and operating the transverse network performance of a micro-service cloud platform in real time, which relate to the technical field of operation and maintenance and comprise the steps of constructing an application service topological graph, a Node topological graph and a host topological graph by collecting flow information among containers Pod, nodes and hosts in the micro-service cloud platform, analyzing network flow indexes based on collected data, analyzing network fault alarm, helping to identify abnormal links and network load abnormality, and automatically identifying and alarming link faults by collecting network flow information in real time through various collection components of virtual machines and container environments. The method improves the accuracy of network fault diagnosis, provides data support for network optimization and resource allocation, shortens fault processing time, reduces service interruption risks caused by network problems, improves fault response speed, and ensures stable operation and service continuity of the micro-service cloud platform.

Description

Method and system for monitoring and operating performance of transverse network of micro-service cloud platform in real time

Technical Field

The invention relates to the technical field of operation and maintenance, in particular to a method and a system for monitoring the transverse network performance of a micro-service cloud platform in real time.

Background

In the age of rapid development of information technology at present, operation and maintenance technology is taken as a key supporting calculation mode, the construction and operation modes of enterprise IT infrastructure, in particular a micro-service cloud platform, a container arrangement system based on Kubernetes is changed deeply, the system becomes a key infrastructure supporting the modern architecture, the wide application of micro-service promotes the transformation of service delivery modes, but simultaneously, new challenges of network performance monitoring and operation and maintenance are brought, and in the micro-service cloud platform, network flow information between a container (Pod) and a Node (Node) becomes an important index reflecting the health condition of the system, and real-time monitoring and analysis of the information are of great importance for guaranteeing service quality and user experience.

However, when the existing network performance monitoring technology is used for coping with the actual demands of the micro service cloud platform, a lot of defects are still exposed, the traditional monitoring tool is difficult to adapt to the dynamic property and the complexity of the micro service environment, so that the constructed topological graph is static and cannot be updated in real time, the real-time state of the micro service cannot be accurately reflected, the real-time property and the accuracy of monitoring data are low, the existing monitoring technology lacks enough depth in network flow data analysis, a fault alarm system has obvious defects in accuracy and timeliness, the existing technology is difficult to effectively identify and alarm network faults, the timeliness of fault positioning and processing is insufficient, the continuity of service and user experience are affected, the monitoring system generally lacks an automatic fault response mechanism, manual intervention is excessively relied, the operation and maintenance cost is greatly increased, the efficiency of fault processing is reduced, and the service capability and the problem solving capability of an operation and maintenance team of the micro service cloud platform are severely limited, and the requirements of a modern cloud computing environment on high-efficiency intelligent operation and maintenance cannot be satisfied.

Disclosure of Invention

The present invention has been made in view of the above-described problems.

The invention solves the technical problems that the existing operation and maintenance technical method has the problems of low real-time performance of topology diagram construction, insufficient flow index analysis capability, poor real-time performance of link fault identification and alarm mechanisms and how to realize omnibearing, real-time and automatic network performance monitoring and operation and maintenance.

The technical scheme includes that the method comprises the steps of constructing an application service topological graph, a Node topological graph and a host topological graph by collecting flow information among containers Pod, nodes and hosts in a micro service cloud platform, analyzing network flow indexes based on collected data, analyzing network fault alarm, helping to identify abnormal links and network load abnormality, and collecting network flow information in real time through various collection components of virtual machines and container environments to automatically identify and alarm link faults.

The method for monitoring the transverse network performance of the micro-service cloud platform in real time is used as a preferred scheme, wherein the construction of the application service topological graph comprises link flow information of uplink and downlink rates, average response time delay, throughput, flow size, packet loss rate and packet error rate among containers Pod, and high-time delay, high-load and abnormal service links are identified through screening conditions.

The method for monitoring the transverse network performance of the micro-service cloud platform in real time is used as a preferred scheme, wherein the construction of the node topological graph comprises the steps of displaying calling relations among different nodes based on an application service topological graph and a deployment relation between the nodes and a container, analyzing performance indexes of node links through flow data, and evaluating time delay and packet loss conditions through virtual network observation.

The method for monitoring the transverse network performance of the micro-service cloud platform in real time is used as a preferable scheme, wherein the construction of the host topological graph comprises the steps of displaying flow information among hosts through node topology and the relation between the nodes and the hosts, helping to analyze network packet loss and time delay from a physical link layer, finding out abnormal load links and providing support data for resource allocation.

The network traffic index analysis comprises traffic index ranking, distribution display and historical trend analysis, wherein the traffic ranking is displayed in the form of a histogram and a pie chart, and the trend analysis displays traffic change conditions in the form of a line graph and a histogram.

The method for monitoring the transverse network performance of the micro-service cloud platform in real time is used as a preferable scheme, wherein the step of identifying abnormal links and network load anomalies comprises the step of counting and displaying network fault alarm information through emergency, important and general different alarm levels, and the step of drilling from a physical network to a virtual network layer by layer is used for helping to check network packet loss and time delay fault points.

The method for monitoring the transverse network performance of the micro-service cloud platform in real time is used as a preferable scheme, wherein the automatic identification and alarm of the link faults comprise the steps that a system runs on a KVM host through a Scapy acquisition component, virtual machine network flow is acquired, the network flow of Pod and nodes is acquired through a Cilium component running in a Kubernetes environment, and the flow of the Calico, globalRouter network environment is acquired.

The invention also aims to provide a real-time monitoring operation and maintenance system for the transverse network performance of the micro-service cloud platform, which can help identify abnormal links and network load anomalies by analyzing network flow indexes and analyzing network fault alarms based on acquired data, and solves the problems of insufficient network performance monitoring and untimely alarms in the existing operation and maintenance technology.

The system comprises a topology construction module, a flow index analysis and alarm module and a link monitoring and automatic alarm module, wherein the topology construction module is used for constructing an application service topological graph, a Node topological graph and a host topological graph by collecting flow information among containers Pod, nodes and hosts in a micro service cloud platform, the flow index analysis and alarm module is used for carrying out network flow index analysis and network fault alarm analysis based on collected data, helping to identify abnormal links and network load abnormality, and the link monitoring and automatic alarm module is used for carrying out automatic identification and alarm on link faults through various collection components of virtual machines and container environments.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor executing the computer program being the step of implementing a method for monitoring operation and maintenance of a micro-service cloud platform transverse network performance in real time.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a method for real-time monitoring of micro-service cloud platform lateral network performance.

The method has the beneficial effects that the method for monitoring the transverse network performance of the micro-service cloud platform in real time constructs an application service topological graph, a Node topological graph and a host topological graph by collecting flow information among containers Pod, nodes and hosts in the micro-service cloud platform, realizes the comprehensive visualization of the network structure of the micro-service cloud platform, provides a clear network architecture view for operation staff, improves the accuracy of network fault diagnosis, effectively reduces misoperation caused by the opacity of the network structure, provides data support for network optimization and resource allocation, performs network flow index analysis based on the collected data, performs network fault alarm analysis, helps to identify abnormal links and network load abnormality, enables the system to perform deep analysis on the network performance, improves the early discovery rate of network faults, helps to rapidly locate problem links through alarm analysis, shortens the fault processing time, reduces service interruption caused by the network problems, automatically identifies and automatically detects the network faults through various acquisition components of virtual machines and container environments, ensures the link fault automatic identification and alarm response, and enables the network fault response to be more stable and stable in the aspect of the network fault response, and the system can rapidly respond to the network fault response and the system to the network fault monitoring.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an overall flowchart of a method for monitoring performance of a micro service cloud platform in real time according to a first embodiment of the present invention.

Fig. 2 is an overall flowchart of a micro service cloud platform transverse network performance real-time monitoring operation and maintenance system according to a third embodiment of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Embodiment 1, referring to fig. 1, provides a method for monitoring and operating performance of a micro service cloud platform in real time, which comprises the following steps:

S1, constructing an application service topological graph, a Node topological graph and a host topological graph by collecting flow information among containers Pod, nodes and hosts in a micro service cloud platform.

Further, the application service topological graph is constructed to include link flow information of uplink and downlink rates, average response time delay, throughput, flow size, packet loss rate and packet error rate among containers Pod, and service links with high time delay, high load and abnormality are identified through screening conditions.

It should be noted that, constructing the node topology graph includes displaying the calling relationship between different nodes based on the application service topology graph and the deployment relationship between the nodes and the container, analyzing the performance index of the node link through the traffic data, and observing through the virtual network to evaluate the time delay and the packet loss condition.

It should also be noted that constructing the host topology graph includes displaying traffic information between hosts through node topology and relationships between nodes and hosts, helping to analyze network packet loss and delay from the physical link level, discovering abnormally loaded links, and providing support data for resource allocation.

It should also be noted that, collecting flow information between Pod and Pod, constructing a logic relationship topological graph between micro services by analyzing the flow data, and the application service topological graph can intuitively display the inter-call relationship between the micro services of the application, and simultaneously display link flow information between Pod and Pod, including index information of uplink and downlink speed, average response time delay, throughput, flow size, packet loss rate and packet error rate, and can screen out service links with high time delay, high load and abnormality by conditions; through virtual network observation under cloud environment, based on application service topology graph and deployment relation between node and micro service (Pod), node topology graph of application can be automatically drawn, call relation between load nodes of application can be clearly seen from node topology graph, through analyzing flow data between nodes, performance index of node link is evaluated, index data of key positions of client, client container node, client host, client gateway, server gateway host, server container node, server are displayed, so as to locate delay and packet loss on virtual network end-to-end path, node links with high delay, high load and abnormality are screened out through condition, through relation between node topology and nodes and hosts, host topology of application is automatically drawn, flow information of hosts can be observed from host topology graph, network flow is analyzed from physical link aspect, location of network packet loss and delay is rapidly analyzed, load and delay are found out, and data support is provided for reasonable allocation of resources.

It should also be noted that, by collecting flow information among the containers Pod, nodes and hosts in the micro service cloud platform, an application service topological graph, a Node topological graph and a host topological graph are constructed, so that comprehensive visualization of the network structure of the micro service cloud platform is realized, a clear network architecture view is provided for operation and maintenance personnel, accuracy of network fault diagnosis is improved, misoperation caused by opaque network structure is effectively reduced, and data support is provided for network optimization and resource allocation.

S2, based on the collected data, the system analyzes the network flow index and carries out network fault alarm analysis to help identify abnormal links and network load abnormality.

Further, the network traffic index analysis comprises traffic index ranking, distribution display and historical trend analysis, wherein the traffic ranking is displayed in the form of a bar graph and a pie chart, and the trend analysis displays traffic change conditions in the form of a line graph and a bar graph.

It should be noted that identifying abnormal links and network load anomalies includes the system counting and displaying network failure alarm information through urgent, important, general different alarm levels, and drilling from the physical network to the virtual network layer by layer to help to troubleshoot network packet loss and time delay failure points.

It should be further noted that, the physical links and virtual links are analyzed from different dimensions of the application service (pod container), the node (node), and the host, the normal, abnormal, and high-latency links are counted, the detailed link list is checked by going down, the network link total list is provided, the performance information of each physical link and virtual link including the indexes of network rate, time delay, packet loss, throughput, and flow rate is intuitively displayed from the list, the display is ordered according to different indexes, the network flow index analysis is performed from different dimensions of the application service (pod container), the node (node), and the host, the network flow index data ranking is that the flow access relation and index data are displayed in a ranking manner, including the form of bar graph, pie graph, top bar graph, stack Top bar graph, table, etc., the network flow index data distribution display is that the flow access relation and index data are displayed in a distribution manner, including the probability distribution graph, geographical location bar graph, and the network flow index data trend analysis is that the flow access relation and index data are displayed in a historical manner, including the stack graph, stack graph analysis, and analysis bar graph.

It should also be noted that, based on the collected data, the system performs network traffic index analysis, performs network fault alarm analysis, helps identify abnormal links and network load anomalies, so that the system can perform deep analysis on network performance, improves early detection rate of network faults, helps an operation and maintenance team to quickly locate a problem link through alarm analysis, thereby shortening fault processing time and reducing service interruption risk caused by network problems.

And S3, collecting network flow information in real time through various collecting components of the virtual machine and the container environment, and automatically identifying and alarming the link faults.

Further, the automatic identification and alarm of the link fault comprises the steps that the system runs on the KVM host through a Scapy acquisition component, acquires network traffic of the virtual machine, runs in a Kubernetes environment through a Cilium component, acquires network traffic of Pod and nodes, and acquires traffic of a Calico, globalRouter network environment.

It should be noted that, the link is automatically constructed, the link information acquired by the system is identified, analyzed and automatically managed by the data processing platform of the system, the relationship link is automatically generated, the virtual machine flow is acquired by operating on the virtual machine of the KVM host machine through the Scapy acquisition component, the network flow of the virtual machine is acquired, the host machine is acquired, the network flow is acquired by operating on the host machine and the cloud virtual machine through the Scapy acquisition component, the K8s flow is acquired, the flow is acquired by installing the Cilium acquisition component on the K8s container, the pod container and the node (node) on the K8s are acquired, and the container network is acquired in the Calico, globalRouter network environment.

It should also be noted that, according to the urgent, important and general different alarm levels, the network fault alarm information is counted and displayed in a graph mode, and the detailed alarm information is checked by drilling down, and the physical links and virtual links are uniformly checked one by one from each level of source physical network, physical server to virtual network OVS, virtual machine and container, and the packet loss and time delay data of the physical network, virtual network and virtual network element are positioned to the fault point.

It should also be noted that, through the multiple collection components of virtual machine and container environment, network flow information is collected in real time, link faults are automatically identified and alarmed, the system can rapidly respond to the link faults, the automatic alarm can rapidly locate fault points, the fault response speed is greatly improved, and the stable operation and the service continuity of the micro-service cloud platform are ensured.

Embodiment 2 provides a method for monitoring the transverse network performance of a micro-service cloud platform in real time for one embodiment of the invention, and in order to verify the beneficial effects of the invention, scientific demonstration is carried out through economic benefit calculation and simulation experiments.

Firstly, using Scapy and Cilium acquisition components to acquire network traffic among Pod, nodes and hosts in a Kubernetes cluster in real time, wherein data acquisition covers uplink and downlink rates, average response time delay, throughput, traffic size, packet loss rate and packet error rate among Pod, and an application service topological graph, a node topological graph and a host topological graph are automatically constructed by using a data processing platform based on acquired data. The topology diagrams show the calling relation among services, interaction among nodes and flow information of a host in detail, analyze collected data, including ranking of flow indexes, distribution display and historical trend analysis, help identify high-delay, high-load and abnormal links in a network, give an alarm to network faults in real time according to preset alarm grades, including faults of emergency, importance and general grades, and show that the network faults can be effectively monitored and timely found and diagnosed by experimental results.

Embodiment 3 referring to fig. 2, for an embodiment of the present invention, a system for monitoring and operating and maintaining the performance of a transverse network of a micro-service cloud platform in real time is provided, which includes a topology construction module, a flow index analysis and alarm module, and a link monitoring and automatic alarm module.

The system comprises a topology construction module, a traffic index analysis and alarm module, a link monitoring and automatic alarm module and a link monitoring and automatic alarm module, wherein the topology construction module is used for constructing an application service topological graph, a Node topological graph and a host topological graph by collecting traffic information among containers Pod, nodes and hosts in a micro-service cloud platform, the traffic index analysis and alarm module is used for carrying out network traffic index analysis and network fault alarm analysis based on collected data to help identify abnormal links and network load abnormality, and the link monitoring and automatic alarm module is used for collecting network traffic information in real time through various collection components of virtual machines and container environments and carrying out automatic identification and alarm on link faults.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method of the embodiments of the present invention. The aforementioned storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, and various media in which program codes can be stored.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium include an electrical connection (an electronic device) having one or more wires, a portable computer diskette (a magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of techniques known in the art, discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like. It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. The utility model provides a real-time monitoring operation and maintenance method of the transverse network performance of a micro-service cloud platform, which is characterized by comprising the following steps:

Constructing an application service topological graph, a Node topological graph and a host topological graph by collecting flow information among containers Pod, nodes and hosts in a micro service cloud platform;

The construction application service topological graph comprises link flow information of uplink and downlink rates, average response time delay, throughput, flow size, packet loss rate and packet error rate among containers Pod, and identifies high-time delay, high-load and abnormal service links through screening conditions;

The construction of the node topological graph comprises the steps of displaying calling relations among different nodes based on the application service topological graph and the deployment relation between the nodes and the container, analyzing performance indexes of node links through flow data, and observing and evaluating time delay and packet loss through a virtual network;

the construction of the host topological graph comprises the steps of displaying flow information among hosts through node topology and the relation between the nodes and the hosts, helping to analyze network packet loss and time delay from a physical link layer, finding abnormal load links and providing support data of resource allocation;

Based on the acquired data, the system analyzes the network flow index and carries out network fault alarm analysis to help identify abnormal links and network load abnormality;

The network flow index analysis comprises flow index ranking, distribution display and historical trend analysis, wherein the flow ranking is displayed in the form of a bar graph and a pie chart, and the trend analysis displays the flow change condition in the form of a line graph and a bar graph;

The network traffic information is acquired in real time through various acquisition components of the virtual machine and the container environment, and the link fault is automatically identified and alarmed;

The automatic identification and alarm of the link faults comprise that the system runs on a KVM host through a Scapy acquisition component, acquires network flow of a virtual machine, runs in a Kubernetes environment through a Cilium component, acquires network flow of Pod and nodes, and acquires flow of a Calico, globalRouter network environment.

2. The method for monitoring and operating the micro-service cloud platform in real time according to claim 1, wherein the step of identifying abnormal links and network load anomalies comprises the step of counting and displaying network fault alarm information through emergency, important and general different alarm levels by a system, and the step of drilling from a physical network to a virtual network layer by layer is performed to help to check network packet loss and time delay fault points.

3. A system for monitoring the operation and maintenance method of the transverse network performance of the micro-service cloud platform in real time by adopting the method according to any one of claims 1-2, which is characterized by comprising a topology construction module, a flow index analysis and alarm module and a link monitoring and automatic alarm module;

The topology construction module is used for constructing an application service topology graph, a Node topology graph and a host machine topology graph by collecting flow information among containers Pod, node nodes and host machines in the micro service cloud platform;

The flow index analysis and alarm module is used for carrying out network flow index analysis and network fault alarm analysis on the basis of the acquired data by the system and helping to identify abnormal links and network load abnormality;

The link monitoring and automatic alarming module is used for collecting network flow information in real time through various collecting components of the virtual machine and the container environment and automatically identifying and alarming the link faults.

4. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the micro service cloud platform lateral network performance real-time monitoring operation and maintenance method of any of claims 1 to 2.

5. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the micro service cloud platform lateral network performance real time monitoring operation and maintenance method of any of claims 1 to 2.