CN120085885A

CN120085885A - A method for updating an operating system based on cloud services

Info

Publication number: CN120085885A
Application number: CN202411992030.5A
Authority: CN
Inventors: 韦宗星; 欧东; 吴珍兴; 潘杨华; 李新彰; 吴大明; 李滔滔; 杨昌富
Original assignee: Guizhou Gongcheng Yunwang Digital Intelligence Industry Development Co ltd
Current assignee: Guizhou Gongcheng Yunwang Digital Intelligence Industry Development Co ltd
Priority date: 2024-12-31
Filing date: 2024-12-31
Publication date: 2025-06-03

Abstract

The application provides an operating system updating method based on cloud service, which comprises the steps of acquiring operating system versions and patch installation conditions of different hosts by scanning system information of cloud hosts, dividing the hosts into corresponding management groups according to operating system types and version numbers, formulating unified patch strategies and updating plans for each group, deploying monitoring agents on the cloud hosts, collecting operation indexes and log information of each service process in real time, collecting collected data to a centralized log analysis platform, analyzing and mining logs through a batch calculation framework to construct a service operation behavior model, and training an abnormality detection model by using a machine learning algorithm in a supervised learning mode for the data in the log analysis platform, wherein the model can automatically discover various abnormal behaviors in the service operation process from massive log data and generate alarm information.

Description

Cloud service-based operating system updating method

Technical Field

The invention relates to the technical field of information, in particular to an operating system updating method based on cloud services.

Background

Problem background:

In the data center of the government network, the version of the operating system of the cloud host is uneven, and some are Windows Server 2008, windows Server 2012 and Windows Server 2016. The patches required for different operating systems are not identical. How to build a unified patch management mechanism to enable operating systems with different versions to obtain the latest security patch in time is a difficult problem for operation and maintenance personnel.

Furthermore, because the network environment of the government network is relatively closed, the cloud host cannot directly connect to microsoft's official update server. Therefore, it is urgent to create a mirror update server inside the government network. However, the update files required by the operating systems of different versions are quite different, and how to establish a set of automatically classified and intelligently identified mirror image sources enables the different operating systems to find the corresponding patch files, which requires operation and maintenance personnel to deeply study the internal structure of the operating systems to find the commonalities and differences among the various versions.

Meanwhile, various services running on the cloud host also need real-time monitoring by operation and maintenance personnel. Different services have different log formats and different operation characteristics, and operation and maintenance personnel need to develop an intelligent log analysis system, so that abnormal behaviors of various services can be accurately found and early warned in time. This, in turn, requires the operator to have a deep understanding of the business logic of the various services, knowing which actions are normal and which are abnormal. In summary, in the operation and maintenance work of the government network, operation and maintenance personnel need to understand the technical details of the operating system and the service deeply, find commonalities in various heterogeneous environments, and establish an automatic and intelligent operation and maintenance mechanism to protect the driving for the safe and stable operation of the government network.

Disclosure of Invention

The invention provides an operating system updating method based on cloud service, which mainly comprises the following steps:

the method comprises the steps of acquiring operating system versions and patch installation conditions of different hosts by scanning system information of cloud hosts, dividing the hosts into corresponding management groups according to operating system types and version numbers, and formulating a unified patch strategy and an update plan for each group;

Setting up a mirror image server in the government network, downloading security update files required by versions of each operating system from a trusted patch source, analyzing key attributes such as applicable systems, version numbers and the like of patches by extracting metadata information of update packages, and automatically classifying the patch files into corresponding catalogues by adopting a clustering algorithm of machine learning;

Aiming at patch files in a mirror image server, key features are extracted, an intelligent classification model is trained, when a new patch file is added, the model can accurately identify the application range of the patch, and the patch file is automatically placed in a corresponding directory, so that different operating system versions can find matched updated files;

Deploying a monitoring agent on a cloud host, collecting running indexes and log information of each service process in real time, collecting collected data to a centralized log analysis platform, analyzing the logs and mining the data through a batch calculation framework, and constructing a behavior model of service running;

Aiming at the data in the log analysis platform, a machine learning algorithm is used, and an abnormality detection model is trained in a supervised learning mode, and can automatically discover various abnormal behaviors in the service running process from massive log data and generate alarm information;

when the monitoring system finds that a certain service is abnormal, according to a pre-configured alarm rule, automatically notifying the operation and maintenance personnel of the abnormal information, triggering an automatic emergency treatment process, and adopting different treatment measures, such as restarting a service process, rolling back a patch version and the like, according to the severity and the influence range of the abnormality;

In the operation and maintenance knowledge base, the configuration parameters, log formats, common faults and other information of different operation systems and services are summarized and generalized to form a set of standardized operation and maintenance specifications and processes, and the knowledge is associated and inferred through knowledge graph technology to form an intelligent operation and maintenance decision support system, so that diagnosis and treatment suggestions are provided for operation and maintenance personnel.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

The invention discloses an intelligent cloud host patch management and anomaly detection method. And dividing and managing the hosts according to the type and version of the operating system by scanning the system information of the cloud host, and formulating a unified patch strategy. And (3) setting up a mirror image server in the government network, and automatically classifying patch files by adopting a machine learning clustering algorithm. The deployment monitoring agent collects service operation data, and the machine learning training abnormality detection model is utilized to find abnormality and alarm from massive logs. And constructing an operation and maintenance decision support system by combining a knowledge graph technology, and providing diagnosis suggestions for operation and maintenance personnel. The cloud host patch intelligent classification and automatic updating method and the cloud host patch intelligent classification and automatic updating method realize intelligent classification and automatic updating of the cloud host patch, real-time monitoring and rapid handling of service anomalies, improve safety and stability of a cloud platform and reduce operation and maintenance cost.

Drawings

Fig. 1 is a flowchart of an operating system updating method based on cloud service according to the present invention.

Fig. 2 is a schematic diagram of an operating system updating method based on cloud services according to the present invention.

Fig. 3 is a schematic diagram of a cloud service-based operating system updating method according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

1-3, The cloud service-based operating system updating method in this embodiment may specifically include:

S101, acquiring operating system versions and patch installation conditions of different hosts by scanning system information of cloud hosts, dividing the hosts into corresponding management groups according to operating system types and version numbers, and making a unified patch strategy and an update plan for each group.

And remotely connecting to the cloud host through an API (application program interface) or an SSH (secure shell) protocol, executing a system command, and acquiring system information such as the type, version number, patch installation list and the like of an operating system of the cloud host. And judging the management group of each cloud host according to the acquired cloud host system information, a decision tree algorithm and a preset operating system type and version number rule. In the configuration management database, according to the unique identifier of the cloud host, updating the field of the management group to which the cloud host belongs, and dividing the cloud host into corresponding groups. For each management group, an operation and maintenance person formulates a unified system patch strategy according to the characteristics of the operation system of the group, and determines a patch list and an installation priority to be installed. And converting the patch strategy into an executable script program, and adopting Ans ible and other automatic operation and maintenance tools to issue patch installation tasks to all cloud hosts in the group in batches. In the patch installation process, the installation progress and the result of each cloud host are obtained in real time and recorded in a system log, so that tracking and auditing are facilitated. After the patch is installed, the system information of the cloud host is scanned again, the latest patch installation condition is obtained, whether the patch installation condition is consistent with the patch strategy is judged, if not, differential updating is carried out, and the cloud host in each group is ensured to meet the requirement of the patch strategy.

Illustratively, the remote connection cloud host acquiring system information is the basis of cloud platform management. Taking a cloud platform as an example, a virtual machine list can be obtained through RESTfulAPI interface call, and then each virtual machine is logged in to execute a system command by using an SSH protocol. Common commands include "uname-a" get operating system type and version, "rpm-qa" or "dpkg-l" list installed packages. This information can be used for subsequent grouping and patch management. The decision tree algorithm can efficiently group cloud hosts. For example, the operating system type (Windows/Linux) may be determined first, then the specific release (CentOS/Ubuntu, etc.) may be determined, and finally the major version number (e.g., centOS 6/7) may be determined. this hierarchical decision may quickly partition the cloud host into appropriate management packets. The decision tree has the advantages of clear rules and high execution efficiency, and is convenient for subsequent maintenance and adjustment of grouping strategies. A Configuration Management Database (CMDB) is the core of IT asset management. In the CMDB, each cloud host has a unique identifier, such as an asset number or instance ID. When the packet to which the cloud host belongs is updated, the field of the management packet is modified only by locating the corresponding record according to the identifier. This centralized management approach may ensure consistency and traceability of asset information. Several factors are considered in formulating the patch policy. Taking Linux system as an example, the patching policy of the CentOS7 may include installing all security updates, installing kernel patches of a specific version, excluding certain patches that may affect the service, etc. Policies should also take into account the order of installation, such as installing the dependent packages first and then installing the primary patches. The careful policy formulation can ensure the safety and stability of the system to the maximum extent. Converting patch policies into executable scripts is the key to automation. Taking Ans ible as an example, playbook in YAML format can be written, defining a sequence of tasks to be performed. The command module is used to execute specific shel l commands, such as installing a specified software package using yum modules. Ans ible is advantageous in that its declarative syntax and idempotency can be performed multiple times on the same playbook without causing system state disturbances. Monitoring patch installation progress in real time is critical for large-scale deployment. A callback function may be added to the Ans ible tasks, and the execution result of each step is written into the log system in real time. A common log system such as ELKstack (ELASTICSEARCH, LOGSTASH, KIBANA) can provide strong log aggregation and visualization capability, so that an operator can conveniently master the patch deployment condition in real time. Verification after patch installation is a necessary step to ensure that policies are executed in place. The system information collection script may be executed again, comparing the list of packages and version numbers before and after installation. If a discrepancy is found, a discrepancy report may be generated and a decision may be made as to whether additional patch installation or rollback operations are required based on the report content. The closed loop management can effectively improve the accuracy and the integrity of patch management. The automation of the whole flow not only improves the efficiency, but also reduces the risk of human errors. For example, manual grouping may result in some specially configured hosts being misclassified, while using algorithms may ensure consistency of classification. The automatic patch deployment can also ensure that all hosts are updated according to the same flow and standard, so that inconsistent system configuration caused by human negligence is avoided. However, automation also presents new challenges. For example, how to handle accidents that may occur during patch installation, such as network outages or insufficient disk space. This requires adding more error handling and retry mechanisms to the script. In addition, for critical business systems, it may be necessary to add a manual validation step to the automated process to balance efficiency and security. In general, the automatic grouping and patch management method based on the system information can greatly improve the management efficiency and consistency of a large-scale cloud environment. By converting the human experience into algorithms and automated scripts, not only can the repetitive work be reduced, but also the precise execution of the management policies can be ensured. meanwhile, a perfect log record and verification mechanism also provides a basis for subsequent audit and optimization.

And acquiring cloud host operating system versions and patch installation conditions through scanning, dividing management groups according to system types and version numbers, making unified patch strategies and updating plans for the groups, and determining patch updating schemes of all cloud hosts.

And comprehensively scanning the cloud hosts through a scanning tool to acquire the type and version number of the operating system of each cloud host and the installed patch information. And according to the acquired type and version number of the operating system, automatically grouping the cloud hosts by adopting a clustering algorithm, wherein the cloud hosts in each group have similar operating system characteristics. And analyzing the patch installation condition of the cloud hosts in the group for each group, identifying the existing patch missing and vulnerability risk, and determining the priority of patch updating according to the risk level. And combining cloud host grouping information and patch priority, making a unified patch strategy and an updating plan of each group, and defining patch updating time nodes and specific operation steps. When a patch strategy is formulated, the dependency relationship and compatibility among different patches are analyzed through an association rule mining algorithm, so that the reasonability and the safety of patch updating are ensured. According to the unified patch strategy and the updating plan, a personalized patch updating scheme is generated for each cloud host, and a patch list and a specific operation flow which need to be installed are definitely displayed. And an automatic operation and maintenance tool is adopted to update and repair patches of the cloud host in batches according to a patch updating scheme, and the stability and controllability of the patch updating process are ensured through a monitoring and rollback mechanism.

Illustratively, cloud host scanning is the basis for patch management. Cloud host information may be obtained comprehensively through specialized scanning tools such as Nessus or OpenVAS. For example, the scan result may show that a host runs centOS7.6, has installed a kernel-3.10.0-957.el7 patch, but lacks the latest security updates. This detailed information provides basis for subsequent grouping and policy formulation. The clustering algorithm may automatically group similar hosts. For example, using the K-means algorithm, featuring operating system type and version, hosts can be categorized into Window Server2016 group, ubuntu18.04 group, and so on. This grouping approach makes subsequent patch management more targeted and efficient. Patch analysis is a key step in identifying risk. By comparing the installed patch with the newly released patch, a potential vulnerability can be discovered. If a certain windows server group lacks an MS17-010 patch, a high-risk persistent blue hole exists, and the highest update priority should be given. This risk-based prioritization ensures that the most critical security issues are resolved in time. Making a patch policy requires comprehensive consideration. In addition to security, traffic continuity is also a consideration. For example, for critical business systems, it may be desirable to update during off-peak hours and reserve rollback time. In particular, the database server may schedule patch updates at 2-4 hours in the morning on weekends, leaving a 2 hour observation period. Such careful planning minimizes the impact on traffic. Analysis of dependencies between patches is critical. Through association rule mining, for example, can be found. Netfiamework updates must precede certain application patch installations and the like. The analysis can avoid unstable system caused by patch conflict and improve updating success rate. The personalized patch scheme considers the characteristics of each host. For example, for Web servers, in addition to conventional system patches, special attention is required to secure updates of Apache or Nginx. For database servers, patches to Oracle or MySQL are important considerations. This customization scheme ensures that each host gets the most appropriate updates. Automated tools such as Ans ible greatly improve patch update efficiency. Batch and orderly patch installation can be realized by writing playbook. For example, it may be provided to update the non-critical servers first, and then to update the core service servers after a period of observation. Meanwhile, by setting check points and rollback mechanisms, if a certain patch is found to cause unstable system, the system can be quickly restored to a pre-update state, and the controllability of the whole process is ensured. This series of steps forms a complete patch management closed loop. From scanning, analysis and strategy formulation to execution and monitoring, each link is carefully designed and mutually supported. The systematic method not only improves the security of the cloud host, but also optimizes the efficiency and reliability of the whole patch management flow. By continuously improving the process, the overall safety level and the operation and maintenance quality of the cloud environment can be continuously improved.

S102, setting up a mirror server in a government network, downloading security update files required by versions of each operating system from a trusted patch source, analyzing key attributes such as applicable systems, version numbers and the like of patches by extracting metadata information of update packages, and automatically classifying the patch files into corresponding catalogues by adopting a clustering algorithm of machine learning.

And obtaining available server resources in the government network, and determining the hardware configuration and network environment for building the mirror image server. And acquiring the security update files of each operating system version from the trusted patch source, and downloading the security update files to the appointed storage position of the mirror image server. And extracting metadata information of the downloaded patch file to obtain key attributes such as a suitable system and version number of the patch. And according to the extracted key attributes of the patch, adopting a K-means clustering algorithm to automatically classify the patch files according to the applicable system and version numbers. And judging the operating system and version of each patch file according to the classification result obtained by the clustering algorithm, and moving the operating system and version to the corresponding directory of the mirror image server. And configuring Web services such as Nginx and the like on the mirror server, providing a downloading service of the patch file, and ensuring that each system in the government network can access and download the required security update. Continuously monitoring the updating of the trusted patch source, automatically downloading the newly released patch file through a timing task, and repeatedly executing the classification and release processes to keep the patch library of the mirror server and the trusted source updated synchronously.

Illustratively, setting up a mirror server inside a government network is a key step for ensuring that the system is updated safely, timely and reliably. First, the network environment and hardware resources need to be evaluated, and a suitable server configuration is selected. For example, a dual-path IntelXeon processor, a 128GB memory, and a server with 10TB storage space may be selected, and a gigabit network interface is provided to meet the storage and high concurrent downloading requirements of a large number of patch files. The acquisition of a secure update file from a trusted patch source is the basis for ensuring the reliability of the patch. Patch files, such as WindowsUpdate, REDHATSATEL L I TE, etc., may be synchronized from an official source by means of rsync, etc. The downloaded patch files need to be preliminarily classified according to the type and version of the operating system and are stored under the appointed directory of the mirror server. Extracting patch metadata information is key to achieving automated classification. The script can be used for analyzing the description information of the patch file, and the applicable attributes such as an operating system, a version number, a release date and the like can be extracted. For example, for Windows patches, resolution may be possible. XML description of msu file, metadata of RPM or DEB packets can be parsed for Linux patches. The automatic classification of patches by adopting a K-means clustering algorithm is an effective method for improving efficiency. The key attributes of the patch are first converted into a vector of values, e.g., the operating system type may be mapped to different values. The number of cluster centers, such as the sum of the number of versions Windows, linux, is then set. Similar patches are aggregated into the same category by iterative computation. The method can effectively process a large number of patch files and automatically identify a new operating system version. According to the clustering result, the patch file can be moved to the corresponding directory structure. For example, directories such as Windows/Server2016 and Linux/CentOS7 can be created, and clustered patch files are stored according to categories, so that subsequent management and downloading are facilitated. Configuration of Web services is a key to providing patch download services for government network internal systems. The Nginx can be used as a Web server to configure a virtual host and access control, and only IP access inside the government network is allowed. Meanwhile, the buffer memory and compression functions can be started, and the downloading efficiency is improved. To ensure security, HTTPS may be configured to encrypt transmissions using self-signed certificates. Continuously monitoring patch sources and automatically updating is an important means to maintain timeliness of the mirror server. A timed task may be written to synchronize new patch files from an official source every morning. After synchronization is completed, the patch classification and release process is automatically triggered, so that the new patch can be ensured to be provided for the internal system of the government network in time. The automatic mechanism can greatly reduce manual intervention and improve the efficiency and accuracy of patch management. Through the steps, an efficient and reliable patch mirror image service can be built in the government network, timely safety update is provided for each system, and the overall network safety is effectively improved. The local mirror image service not only can accelerate the patch distribution speed, but also can effectively control the patch source and reduce the external network dependence, and is an important safeguard measure for the security operation and maintenance of the government network.

S103, extracting key features aiming at patch files in the mirror image server, training an intelligent classification model, and when new patch files are added, accurately identifying the application range of the patch, automatically putting the patch files into corresponding catalogues, and ensuring that matched updated files can be found out by different operating system versions.

And acquiring metadata of all patch files in the mirror image server, analyzing the metadata of the files to obtain file names, sizes, hash values and associated information lists, and obtaining a preliminary classification set of patch files of different operating systems. And determining an operating system field and a version field according to each patch file association information list in the preliminary classification set, extracting the version information of the operating system applicable to the patch files, and labeling the patch file set according to the version information to obtain sample data. Analyzing the sample data to generate a binary data stream, carrying out numerical statistics, such as byte frequency, according to the binary data stream, judging that the occurrence frequency of a specific byte accords with normal distribution, and calculating a file difference feature vector through a statistical feature value construction method. And (3) training by adopting a support vector machine algorithm and combining the extracted file difference feature vectors to obtain a supervised learning classifier, and repeatedly and iteratively calculating according to the labeling data to judge that the error value between the predicted value and the labeled value is continuously reduced, so that the accuracy of the training result is finally converged and determined. Obtaining a new patch file, generating a binary data stream, carrying out numerical statistics according to the binary data stream, and constructing a file difference feature vector of the new patch file according to the input feature value requirement defined by a classifier. And inputting a file difference characteristic vector of the new patch file into a supervised learning classifier, obtaining file application range classification, obtaining a classification result, and determining a specific operating system version directory in which the patch is to be placed. And according to the classification result, migrating the newly added patch file to a specific classification directory in the server, updating the patch file index, and judging that the downloading list content of the patch file associated with the operating system is completely refreshed, so that all online users of the server are automatically synchronized.

Illustratively, acquiring metadata of all patch files in the mirror server first requires traversing the server storage directory and reading metadata information of each patch file using the file system API. For example, basic information such as file name, size, hash value and the like is acquired by using a 'stat' command of a Linux system or a 'GetFi leAttributesEx' function of a Windows system. The hash value can be calculated through the SHA-256 algorithm, so that the integrity and the uniqueness of the file are ensured. The associated information list includes the type of operating system, version number, etc. to which the patch applies, and this information is typically stored in metadata or an attached description file of the patch file. After analyzing the file metadata, the patch files are primarily classified according to the operating system field and the version field. For example, the patch file of the Windows system may include fields such as "Windows10" and "Window Server2016", and the patch file of the Linux system may include fields such as "Ubuntu20.04" and "CentOS 7". And extracting the key information through a regular expression matching or character string analysis method to form a preliminary classification set. And further extracting the version information of the operating system suitable for the patch file on the basis of the preliminary classification set. For example, if a patch file metadata contains "applicable to Windows10 version 1809", windows10 and 1809 are extracted as key information. And labeling the patch file set according to the information to form sample data. Sample data may be represented in the form of { (file name, size, hash value, operating system, version number) }. Analyzing the sample data to generate a binary data stream, and analyzing the characteristics of the file by counting byte frequencies in the binary data stream. For example, counting the frequency of occurrence of each byte value in the file, if the frequency distribution of a specific byte accords with the normal distribution, the file difference feature vector can be constructed by using the statistical feature values. The feature vector may contain multiple dimensions, such as the number of times a high frequency byte occurs, the variance of the byte distribution, etc. And training a supervised learning classifier by adopting a Support Vector Machine (SVM) algorithm and combining the extracted file difference feature vectors. In the training process, the parameters of the classifier are continuously adjusted through repeated iterative computation, so that the error value between the predicted value and the marked value is continuously reduced. When the error value converges below a certain threshold, the training result is considered to reach a higher accuracy. For example, after 1000 iterations, the error value drops from 0.5 to 0.01, indicating that the classifier has learned the features of the sample data better. After the newly added patch file is obtained, a binary data stream is generated as well, and byte frequency statistics is carried out. And constructing a file difference feature vector of the new patch file according to the input feature value requirement defined by the classifier. For example, the byte frequency statistics of the new patch file show that the frequency of occurrence of a particular byte is highly similar to a certain class in the training sample, and the feature vector is input to the trained classifier. And determining an operating system and a version applicable to the new patch file according to the result output by the classifier. For example, if the output result of the classifier is "Windows10 version 1903", the patch file is migrated to the corresponding "Windows10/1903" directory in the server. Meanwhile, the index of the patch file is updated, so that each system in the government network can acquire the latest patch information in time. And finally, judging whether the content of the download list of the related patch file of the operating system is completely refreshed. For example, by checking the cache update status of each system patch management tool, it is ensured that all online users can automatically synchronize the latest patch files. By the aid of the method, the automation level of patch management is improved, and safety and stability of a government network internal system are guaranteed. Through the steps, automatic classification and efficient management of patch files are realized by using a machine learning and statistics method, and the safety protection capability of the government network is improved. Each step is looped, so that a strict logic relation and thinking chain are formed, and the accuracy and timeliness of patch management are ensured.

And S104, deploying a monitoring agent on the cloud host, collecting operation indexes and log information of each service process in real time, collecting collected data to a centralized log analysis platform, analyzing the logs and mining the data through a batch calculation framework, and constructing a behavior model of service operation.

The deployment agent terminal collects the running index value and log stream of each process in the service process set of the cloud host, and uploads data. According to a preset log stream data format specification, the log analysis platform receives the data uploaded by the proxy end, performs data classification operation on log streams generated by different service processes, and obtains different identifications by different classifications. And through the data content of each field in the log stream, the analyzer module in the analysis platform executes information extraction operation to obtain log key information, logs of different service processes and determine the identification of the different service processes. And performing abnormality detection operation of different processes by adopting a pre-trained machine learning model, and inputting log key information and identification into a classification model to obtain normal or abnormal classification results of each process. Acquiring the running index value of each process, constructing a running index value set of each process, inputting the running index value set into a pre-trained regression model, determining the output of the regression model, and obtaining the performance prediction result of each process. And according to the regression model output result and the operation index value, the calculation block performs statistical analysis operation on the load state of each service process in the service process set, and obtains a behavior sequence by comparing the process load conditions of similar behaviors in the historical data. And constructing a behavior library according to the behavior sequence and the current service process behavior, wherein the behavior sequence is represented by a Markov model, and determining the next behavior of the service process.

The deployment agent terminal collects running index values and log streams of all service processes on the cloud host and uploads the running index values and log streams to the log analysis platform. It is assumed that a cloud host is running three service processes of a Web server, a database server and a cache server. The agent end can monitor running indexes such as CPU utilization rate, memory occupation, network flow and the like of the processes in real time, and collect log information generated by the processes. The log stream data format specification is preset to be JSON format and comprises a time stamp, a process ID, a log level, log content and other fields. For example, the log of the Web server may contain information such as URL of HTTP request, response time, etc., the log of the database server may contain information such as SQL query statement, execution time, etc., and the log of the cache server may contain information such as cache hit rate and expiration time, etc. After receiving the data uploaded by the proxy end, the log analysis platform firstly carries out classification operation on log streams generated by different service processes. By analyzing the process ID field in the log, the log stream is divided into three types of Web server log, database server log and cache server log, and different identifications are given. Next, the parser module performs an information extraction operation on each type of log. For example, the URL and response time are extracted from the Web server log, the SQL statement and execution time are extracted from the database server log, and the cache hit rate and expiration time are extracted from the cache server log. This key information will be used for subsequent anomaly detection and performance prediction. And adopting a pre-trained machine learning model to detect the abnormality. And assuming that a decision tree-based classification model is used, inputting log key information and identification, and outputting normal or abnormal classification results of each process. For example, if the response time of a Web server suddenly increases, the model may mark it as abnormal, and if the SQL query execution time of the database server is abnormally long, the model may mark it as abnormal. And acquiring the running index value of each process, and constructing a running index value set of each process. For example, the set of running index values for the Web server may include CPU utilization, memory usage, network traffic, and the like. These index values are input into a pre-trained regression model to predict the performance of each process. Assuming a regression model based on random forests is used, the model predicts performance metrics over a period of time in the future based on historical data. And calculating the load state of each service process according to the regression model output result and the current operation index value. For example, by comparing the CPU utilization of the current Web server with the process load conditions of similar behavior in the history data, it can be derived whether the load state of the current Web server is high, medium or low. And comparing the process load conditions of similar behaviors in the historical data to obtain a behavior sequence. For example, the historical data shows that the query execution time of the database server increases significantly each time the CPU usage of the Web server exceeds 80%. This sequence of actions can be represented by a Markov model, predicting the next action of the service process. And constructing a behavior library according to the behavior sequence and the current service process behavior. For example, the CPU utilization of the current Web server has reached 85%, and it is predicted that the query execution time of the database server may increase according to the markov model in the behavior library, so that optimization measures, such as increasing buffering or optimizing SQL queries, are taken in advance. Through the steps, the log stream can be monitored and classified in real time, the performance and abnormal behavior of the process can be predicted, the intervention is performed in advance, and the stability and the high efficiency of the system are ensured. Such technical effects include improving system availability, reducing fault response time, optimizing resource allocation, and the like. For example, in practical application, when the response time of the Web server is detected to be abnormally increased, the system can automatically trigger the capacity expansion operation and increase the server instance, so that the load pressure is relieved, and when the SQL query execution time of the database server is abnormally long, the system can automatically optimize query sentences or increase indexes, and the query efficiency is improved. Through the multidimensional monitoring and predicting mechanism, the service process on the cloud host can run in an efficient and stable environment, and the overall performance and user experience of the system are greatly improved.

S105, training an anomaly detection model by using a machine learning algorithm aiming at the data in the log analysis platform in a supervised learning mode, wherein the model can automatically discover various anomaly behaviors in the service operation process from massive log data and generate alarm information.

A log dataset is constructed from the log content that should overlay various log entries of the history data record, where normal and abnormal log entries have indicia of data tags. The method comprises the steps of preprocessing data of a log data set with a data tag, extracting features, obtaining a numerical data set, dividing the numerical data set into a training set and a testing set, and balancing sample proportions of different service states in the training set. And training the model supporting the vector machine algorithm through the training set to obtain an initial detection model, testing the testing set through the initial detection model, and calculating various indexes of the initial detection model aiming at the detection result of the testing set. And acquiring log contents generated in a real environment, preprocessing data, acquiring a trained model of a previous time sequence as a reference detection model, processing the log data by adopting the reference detection model, generating a preliminary prediction label, and then judging the label. And carrying out label judgment according to the preliminary prediction label, and determining that the final prediction label is abnormal if a certain log record is abnormal according to the preliminary prediction label, and the time and the log content of the log record and the log data of a certain abnormal record in the existing mark database accord with the matching condition. And carrying out abnormality judgment on the final prediction label, judging the alarm level corresponding to a certain log if the final prediction label of the log is abnormal, adopting alarm schemes of different levels to carry out alarm, recording a final marking result into a marked database, and then continuously executing judgment, and judging that the detection model has defects if the final prediction label is normal. Judging the false alarm quantity proportion in the marked database, if the false alarm quantity proportion exceeds the set proportion, taking out all abnormal label samples in the marked database for retraining to obtain a new time sequence detection model, and then redeploying the model.

Illustratively, constructing a log data set from log content first requires extracting various types of log entries, including normal and abnormal records, from the historical data. For example, in an order processing system of an e-commerce platform, a normal log may include entries of "order creation success", "payment completion", etc., and an abnormal log may include "payment failure", "stock shortage", etc. Each log record needs to be labeled with a data tag, such as "normal" or "abnormal". And in the data preprocessing stage, cleaning and characteristic extraction are carried out on the log data. Assuming a log record of "2023-10-0110:00:00 user 123 payment failed," the features extracted after preprocessing may include a time stamp, user ID, operation type, and result status, etc. Converting these features into numeric data, such as time stamping into Unix time stamping, operation type and result state are unithermally encoded. Dividing the preprocessed data into a training set and a testing set, and ensuring sample proportion balance of different service states in the training set. For example, the ratio of normal log to abnormal log in training set is 1:1 to avoid model bias to some kind of data. Model training is performed using a Support Vector Machine (SVM) algorithm. Assuming that the training set contains 10,000 log records, the SVM model establishes classification boundaries by learning the characteristics of these records. After training, the model is verified by using the test set, and indexes such as accuracy, recall rate, F1 score and the like are calculated. For example, the test set contained 2,000 records with a model accuracy of 90%, a recall of 85% and an F1 score of 0.875. In a real environment, new log data is continuously collected and preprocessed. Assuming 1,000 new logs are collected on a certain day, a previously trained benchmark test model is used for processing to generate a preliminary prediction label. If a log is preliminarily predicted to be abnormal and the time and the content of the log are matched with those of an abnormal record in the marked database, if the payment of the user 456 fails in 2023-10-0211:00:00, the final prediction label is confirmed to be abnormal. And performing abnormality judgment on the final prediction label. If a log is finally predicted to be abnormal, judging the alarm level according to the severity of the log. For example, a payment failure may trigger a medium-level alert, while a system crash triggers a high-level alert. The warning information informs related personnel in a mail, short message and other modes, and the final marking result is recorded in a marked database. If the false alarm number proportion in the marked database exceeds a set threshold, if the false alarm rate reaches 20%, the model is required to be retrained. And extracting all abnormal label samples from the database, retraining the SVM model by combining new data, generating a new time sequence detection model, and redeploying. When the behavior library is constructed, the behavior sequence is represented by a Markov model. Assuming that the historical behavior sequence of a certain service process is 'start-run-stop', the next behavior is predicted by a Markov model. If the current behavior is "running", the probability that the model predicts that the next behavior is "stop" is higher. The method has the advantages that the model can more accurately identify the abnormality through continuous learning and optimization, and false alarms and missing alarms are reduced. Meanwhile, by combining the historical data and the real-time data, the running state of the service process can be mastered more comprehensively, and the stability and reliability of the system are improved. In practical application, the construction of the log data set and the model training are a dynamic process, and continuous iteration and optimization are needed. By the method, the abnormality can be found and processed in time, and powerful support can be provided for optimization and upgrading of the system. for example, by analyzing the exception log, frequent error reporting of a certain interface is found, and the optimization can be performed pertinently, so that the system performance is improved. In summary, through the steps of constructing a log data set, preprocessing data, training and optimizing a model, judging abnormality, alarming and the like, the monitoring and management level of the service process on the cloud host can be effectively improved, and the stable operation of the system is ensured.

The method comprises the steps of obtaining an abnormal log by analyzing attributes such as log level and event time, judging the abnormal type according to the attributes such as request path and response state, training an abnormal detection model by adopting a random forest algorithm, and obtaining an abnormal detection result.

The method comprises the steps of obtaining log data generated by service operation, analyzing that each piece of log data contains log grade and event time attribute information, and if the log grade accords with a predetermined abnormal grade, extracting the log as an abnormal log to obtain a plurality of abnormal log samples. And acquiring each abnormal log, analyzing the request path and response state attribute information of each abnormal log, matching the request path with a pre-constructed path rule base to obtain matching information, matching the response state with a pre-determined abnormal state table to obtain abnormal state information, and if the matching information accords with a certain type and the abnormal state accords with a certain type, judging the abnormal type by combining the request path and the response state attribute information and obtaining the abnormal type of each abnormal log. The method comprises the steps of extracting abnormal log characteristics, wherein the characteristic extraction comprises abnormal log text numerical characteristics and abnormal log time sequence characteristics, the abnormal logs are divided into training sets and test sets according to the number, one half of the abnormal logs in the training sets are used for constructing a model, and the other half of the abnormal logs are used for model training, so that all training set data are obtained. And determining an abnormal log characteristic set of the training set according to all training set data generated during the data preprocessing operation. And constructing a feature matrix of each abnormal log of the training set. And aiming at feature matrixes constructed by the abnormal log feature sets of all training sets, calculating according to statistical values of the abnormal log features of each training set in different categories to obtain information gain rates of the abnormal log features of each training set, sequencing the abnormal log features of the training sets according to the information gain rates, sequentially inputting the sequenced feature matrixes into a constructed random forest model, training the random forest model, and carrying out cyclic iteration until convergence conditions are reached to obtain the random forest model for completing training. And obtaining each abnormal log of the test set during data preprocessing operation, determining each abnormal log feature set of the test set, and obtaining the statistical magnitude of each abnormal log of the test set in different categories to calculate and obtain the information gain rate of the abnormal log features of each test set. And sequencing the abnormal log characteristics of the test set according to the information gain rate. And obtaining the feature matrix of the ordered test set. And inputting the ordered abnormal log feature matrix of the test set according to the random forest model after training, performing forward operation to obtain a predicted output value, and comparing the label values of the test set to obtain the predicted accurate value of the abnormal log of each test set. And calculating an average value to obtain the model accuracy. And judging whether the next iteration is carried out according to the accuracy. And after the iteration is finished, obtaining an optimal detection model. And receiving an online log through an interface, analyzing log grade and time information contained in the log, and acquiring a log sample if the log grade is a preset abnormal grade. Analyzing the request path and the response state information of the log, carrying out path rule matching according to the request path to obtain matching information, carrying out state matching according to the response state to obtain state information, classifying the log into an abnormality if the matching information is matched with the state information by a preset abnormality type, and carrying out detection operation on the abnormality log according to an optimal detection model to obtain a detection result of whether the log is abnormal.

The log analysis platform is used for detecting the abnormality through a machine learning algorithm, and log data generated by service operation is needed to be acquired first. Taking a Web server as an example, the log may include information such as access time, request path, response status code, and the like. When analyzing log data, the important attention is paid to attributes such as log grade, event time and the like. For example, the ERROR level log is regarded as an abnormal sample, so that potential problem logs can be quickly screened out. For further analysis of the exception log, the request path and response state need to be parsed. Assuming that a log display "/api/user" path returns 500 status codes, it can be determined that this is a user interface server error by matching with predefined path rules and exception status tables. The method can quickly locate the problem, and is favorable for timely processing faults. Feature extraction is a key step in model training. For text-based logs, numerical features may be extracted using methods such as TF-IDF, and for time series data, time-dependent features may be extracted using sliding window techniques. For example, counting the frequency of occurrence of a particular error within 30 minutes helps to find periodic or bursty anomalies. In the model training phase, random forest algorithms are widely used for their excellent performance and interpretability. The efficiency and accuracy of the model can be improved by calculating the information gain ratio to select the most differentiated features. For example, if the information gain rate of the feature "response time" is found to be highest, indicating that it is most critical to determining anomalies, priority should be given. Model evaluation uses accuracy as an indicator, which helps measure the ability of the model to identify anomalies. Assuming 95% accuracy on the test set, this means that the model can accurately identify 95% of anomalies, but there is still 5% misjudgment and further optimization is required. Finally, the trained model is applied to real-time log analysis. When a new log is received, the system can quickly judge whether the log is abnormal. For example, if an API is detected to return an error frequently in a short time, the model may mark it as abnormal, and trigger an alarm mechanism, so that an operator can intervene in time. Compared with the traditional rule matching, the log analysis method based on machine learning has stronger adaptability and accuracy. The method can automatically learn complex abnormal modes, adapt to continuously-changing system environments and effectively improve the stability and reliability of services. Meanwhile, through continuous model updating and optimization, the abnormality detection capability of the system can be continuously improved, and long-term benefits are brought to IT operation and maintenance of enterprises.

And S106, when the monitoring system finds that a certain service is abnormal, automatically notifying the operation and maintenance personnel of abnormal information according to a pre-configured alarm rule, triggering an automatic emergency treatment process, and taking different treatment measures, such as restarting a service process, rolling back a patch version and the like according to the severity and the influence range of the abnormality.

According to the log information of the collection server cluster, forming a log information stream through log structuring processing, identifying unstructured abnormal texts in the log information stream by using a natural language processing algorithm, comparing vectors corresponding to the abnormal texts with vectors in a pre-established feature library, calculating the matching degree, and triggering an alarm program if the matching degree exceeds a preset threshold value of 85%. And analyzing the matching degree sequence by using the time sequence prediction model, predicting the time point of the subsequent error log, if the time point sequence is dense, obtaining a service process fault conclusion, judging that the system reaches the severity level, acquiring historical state information as characteristic parameters, and transmitting the historical state information into a pre-constructed multi-mode fault prediction model to obtain an output result of the service to be restarted. Training a cyclic neural network model aiming at the fault information sequence, generating a fault investigation scheme set by predicting the next character of the information sequence, scoring the confidence degree of different steps in the fault investigation scheme set by comparing the actual service log content to obtain an optimal fault investigation strategy, and judging that the first step of the strategy needs to acquire a code management library modification record. Extracting a recent code modification record file and a historical version information file from a code management library, analyzing the difference between the files to obtain an associated file group, calculating the difference proportion before and after modification for each file in the associated file group, and marking the modification record file as a suspicious file if the difference proportion exceeds the limit. Transmitting the suspicious file information to a vulnerability scanning tool for file security assessment, creating an abstract syntax tree according to a code syntax structure, traversing each node of the abstract syntax tree, performing comparison operation to obtain the number of vulnerabilities, and determining the association degree of the marked file and the vulnerabilities when the number of vulnerabilities exceeds a security threshold. If the number of loopholes exceeds 8 of preset thresholds, a recently deployed patch information set is obtained, vectorization processing is carried out on the patch information set, a vector space model technology is adopted to represent a current version state, and the current version state and a historical version state are used for obtaining a difference point set through a version comparison tool to obtain a patch information file set needing rollback. And constructing a reverse operation instruction set according to the patch information file set to be rolled back and the current version information, performing cluster analysis operation on a new log information vector generated after the instruction is executed by adopting a cluster analysis algorithm, and judging the risk degree of the instruction set by evaluating the compactness degree in a cluster result to determine whether to execute the reverse operation instruction set.

Illustratively, the collection server cluster log information is the starting point for anomaly detection. It is assumed that a server cluster of a certain e-commerce platform generates several GB of log data every day, and these logs record detailed information such as user access, transaction processing, database operation, etc. Unstructured text is converted into a structured data format, such as JSON or CSV, through log structured processing, which facilitates subsequent analysis. Unstructured anomalous text is identified using Natural Language Processing (NLP) algorithms, e.g., keywords in the log are extracted using TF-IDF algorithm, and the text is converted to a vector representation in conjunction with Word embedding techniques such as Word2 Vec. Assuming that a certain log record is 'database connection failure', comparing a vector corresponding to the text with a 'database abnormal' vector in a feature library after NLP processing, calculating cosine similarity, and triggering an alarm program if the similarity exceeds 85%. A timing prediction model such as ARIMA or LSTM network is used to analyze the matching degree sequences. Assuming that the log matching degree of database connection failure frequently exceeds a threshold value in the past week, the model predicts that more similar errors may occur in the future 24 hours, and if the predicted time point sequences are dense, the service process is judged to be likely to have faults, and the system reaches a severity level. Historical state information such as CPU utilization rate, memory occupation, network delay and the like is acquired and is used as characteristic parameters to be transmitted into a pre-constructed multi-mode fault prediction model. The model may contain a hybrid structure of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), with output results suggesting restart services to mitigate the failure. And training a cyclic neural network model aiming at the fault information sequence, and generating a fault troubleshooting scheme set by predicting the next character of the information sequence. For example, the model predicts "check database connection configuration", scores the confidence of this step in combination with the actual service log content, and finally determines that the best troubleshooting strategy is "first step acquire code management library modification record". The method comprises the steps of extracting a recent code modification record file and a historical version information file from a code management library, and analyzing the differences by using a file comparison tool (such as diff) to obtain an associated file group. Assuming that the difference ratio of a certain file before and after modification exceeds 30%, the file is marked as suspicious. The suspicious file information is transmitted to a vulnerability scanning tool (such as a SonarQube), nodes are traversed by creating Abstract Syntax Trees (AST), and the comparison operation discovers the vulnerability number. If the number of vulnerabilities exceeds 8, determining that the markup file is highly associated with the vulnerability. And obtaining a recently deployed patch information set, carrying out vectorization processing, and representing the current version state by adopting a Vector Space Model (VSM) technology. And obtaining the difference point set through a version comparison tool (such as GitDiff) to obtain the patch information file set needing to be rolled back. And constructing a reverse operation instruction set according to the patch information file set needing to be rolled back and the current version information. And (3) carrying out cluster analysis on the new log information vector generated after the instruction is executed by adopting a cluster analysis algorithm (such as K-means), evaluating the tightness degree in the cluster result cluster, and judging the risk degree of the instruction set. If the intra-cluster compactness is high, indicating a low risk of reverse operation, the instruction set may be executed. Through the steps, abnormal behaviors are automatically found from log data, alarm information is generated, and stability and safety of the system are ensured through multi-level model analysis and operation instruction generation. The comprehensive abnormality detection and fault detection mechanism can effectively improve the operation and maintenance efficiency and response speed of the system and reduce the service interruption time caused by faults.

S107, summarizing and summarizing configuration parameters, log formats, common faults and other information of different operating systems and services in an operation and maintenance knowledge base to form a set of standardized operation and maintenance specifications and processes, and correlating and reasoning the knowledge through a knowledge graph technology to form an intelligent operation and maintenance decision support system, so as to provide diagnosis and treatment suggestions for operation and maintenance personnel.

And carrying out association and reasoning through a knowledge graph technology according to the operating system and service configuration information in the operation and maintenance knowledge base to obtain a standardized configuration parameter template, so as to form a configuration specification base. And analyzing and extracting the log format in the operation and maintenance knowledge base by adopting a natural language processing technology, acquiring key fields and event types, constructing a log analysis model, and realizing automatic analysis of the log. And carrying out association analysis on common fault information in the operation and maintenance knowledge base through a knowledge graph technology to obtain reasons, influence ranges and solutions of faults, thereby forming a fault diagnosis knowledge base. And according to the configuration specification library, the log analysis model and the fault diagnosis knowledge base, adopting a rule-based reasoning engine to realize the automatic decision of the operation and maintenance flow and generating standardized operation and maintenance operation steps. And training the historical operation and maintenance data through a machine learning algorithm to obtain a classification model and a prediction model of the operation and maintenance event, so as to realize early warning and advanced treatment of potential faults. Integrating the operation and maintenance decision support system with a monitoring platform, acquiring the operation state data of the system and the service in real time, judging whether abnormality exists or not through a knowledge graph reasoning and machine learning model, and giving diagnosis suggestions. According to the diagnosis suggestions and standardized operation steps, an automatic treatment script is generated, and the operation and maintenance automatic platform is used for executing the operation and maintenance operation, so that the quick repair and system recovery of faults are realized, and the operation and maintenance efficiency and the system availability are improved.

Illustratively, the operation and maintenance knowledge base is the core of IT operation and maintenance management, and contains a great deal of configuration information, log format and fault handling experience. Through knowledge graph technology, the information can be structured and subjected to association analysis to form a standardized configuration specification library. For example, for the security configuration of the Linux server, key parameters such as the maximum login failure times, the password complexity requirements and the like can be extracted, and the association relationship between the key parameters is established. Thus, whether the configuration meets the specification can be rapidly checked, and potential safety risks can be deduced. Log parsing is an important link in operation and maintenance automation. Key information can be extracted from unstructured log text through natural language processing techniques. For example, for a log of "user login failure", fields such as a user name, an IP address, a failure cause, etc. may be identified. These structured log data provide the basis for subsequent analysis and decision making. The construction of the fault diagnosis knowledge base utilizes the reasoning capability of the knowledge graph. By analyzing historical fault cases, associations between fault symptoms, causes, and solutions can be established. For example, when a Web service is found to be slow in response, the system can infer possible reasons including exhaustion of a database connection pool, network congestion and the like according to the knowledge graph, and provide corresponding investigation steps. Rule-based reasoning engines are key to implementing automated decisions. The method combines configuration specifications, log analysis results and fault diagnosis knowledge to generate a standardized operation and maintenance operation flow. For example, when it is detected that server CPU usage continues beyond 90%, the system may automatically generate a series of investigation steps, including checking progress status, analyzing load sources, etc. The machine learning algorithm plays an important role in operation and maintenance early warning. Through training the historical data, a prediction model for system abnormality can be established. For example, by analyzing past disk usage trends, the system can predict when disk space may be exhausted and send out early warning ahead of time. The operation and maintenance decision support system is integrated with the monitoring platform, so that real-time abnormality detection and diagnosis can be realized. For example, when the inquiry delay of a certain database service is monitored to suddenly increase, the system can immediately start a diagnosis flow, analyze whether problems of slow inquiry, index failure and the like exist, and give optimization suggestions. Finally, the generation and execution of automated handling scripts is critical to improving operation and maintenance efficiency. For example, when it is diagnosed that a service abnormality is caused by memory leakage, the system may automatically generate a script for restarting the service, and execute the script through the operation and maintenance platform to achieve quick recovery. This not only reduces human intervention, but also greatly shortens the failover time. Through the series of technology and flow, an intelligent operation and maintenance system can be constructed, so that the problems can be responded and solved quickly, potential faults can be predicted and prevented, and the usability and stability of the system are improved obviously.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method for updating an operating system based on a cloud service, characterized in that the method comprises:

By scanning the system information of the cloud host, the operating system version and patch installation status of different hosts are obtained. According to the operating system type and version number, the hosts are divided into corresponding management groups, and a unified patch strategy and update plan are formulated for each group;

Build a mirror server inside the government network to download security update files required for each operating system version from a trusted patch source. By extracting metadata information from the update package and analyzing key attributes such as the applicable system and version number of the patch, the machine learning clustering algorithm is used to automatically classify the patch files into corresponding directories.

For the patch files in the image server, key features are extracted and an intelligent classification model is trained. When a new patch file is added, the model can accurately identify the applicable scope of the patch and automatically put it into the corresponding directory to ensure that different operating system versions can find matching update files.

Deploy monitoring agents on cloud hosts to collect the operating indicators and log information of each service process in real time, aggregate the collected data to a centralized log analysis platform, parse and mine the logs through a batch computing framework, and build a behavioral model for service operation.

Based on the data in the log analysis platform, we use machine learning algorithms to train an anomaly detection model through supervised learning. This model can automatically discover various abnormal behaviors during service operation from massive log data and generate alarm information.

When the monitoring system finds that a service is abnormal, it automatically notifies the operation and maintenance personnel of the abnormal information according to the pre-configured alarm rules, and triggers the automated emergency response process. Different processing measures are taken according to the severity and impact scope of the abnormality, such as restarting the service process, rolling back the patch version, etc.

In the operation and maintenance knowledge base, the configuration parameters, log formats, common faults and other information of different operating systems and services are summarized and organized to form a set of standardized operation and maintenance specifications and processes. Through knowledge graph technology, this knowledge is associated and inferred to form an intelligent operation and maintenance decision support system to provide diagnosis and disposal suggestions for operation and maintenance personnel.

2. The method according to claim 1 is characterized in that the system information of the cloud host is scanned to obtain the operating system version and patch installation status of different hosts, the hosts are divided into corresponding management groups according to the operating system type and version number, and a unified patch strategy and update plan are formulated for each group, including:

Remotely connect to the cloud host through the API interface or SSH protocol, execute system commands, and obtain system information such as the cloud host's operating system type, version number, and patch installation list;

Based on the obtained cloud host system information, a decision tree algorithm is used to determine the management group to which each cloud host belongs according to the preset operating system type and version number rules;

In the configuration management database, according to the unique identifier of the cloud host, the field of the management group to which it belongs is updated, and the cloud host is divided into the corresponding group;

For each management group, the operation and maintenance personnel formulate a unified system patch strategy based on the characteristics of the operating system of the group, determine the list of patches to be installed and the installation priority;

Convert the patch strategy into an executable script program, and use automated operation and maintenance tools such as Ansible to batch issue patch installation tasks to all cloud hosts in the group.

During the patch installation process, the installation progress and results of each cloud host are obtained in real time and recorded in the system log for easy tracking and auditing;

After the patch installation is completed, scan the system information of the cloud host again to obtain the latest patch installation status and determine whether it is consistent with the patch policy. If not, perform differential updates to ensure that the cloud hosts in each group meet the requirements of the patch policy.

It also includes: obtaining the cloud host operating system version and patch installation status through scanning, dividing management groups according to system type and version number, formulating a unified patch strategy and update plan for the groups, and determining the patch update plan for each cloud host.

3. The method according to claim 2 is characterized in that the step of obtaining the cloud host operating system version and patch installation status by scanning, dividing management groups according to system type and version number, formulating a unified patch strategy and update plan for each group, and determining a patch update plan for each cloud host comprises:

Use scanning tools to perform a comprehensive scan of the cloud host to obtain the operating system type, version number, and installed patch information of each cloud host;

Based on the obtained operating system type and version number, a clustering algorithm is used to automatically group the cloud hosts. The cloud hosts in each group have similar operating system characteristics.

For each group, analyze the patch installation status of the cloud hosts in the group, identify the existing patch missing and vulnerability risks, and determine the priority of patch updates based on the risk level;

Based on the cloud host grouping information and patch priority, formulate a unified patch strategy and update plan for each group, and clarify the time nodes and specific operation steps for patch updates;

When formulating patch strategies, we use association rule mining algorithms to analyze the dependencies and compatibility between different patches to ensure the rationality and security of patch updates.

Generate a personalized patch update plan for each cloud host based on a unified patch strategy and update plan, clearly specifying the patch list that needs to be installed and the specific operation process;

Adopt automated operation and maintenance tools to batch update and repair cloud hosts according to the patch update plan, and ensure a smooth and controllable patch update process through monitoring and rollback mechanisms.

4. The method according to claim 1 is characterized in that a mirror server is built inside the government network, and the security update files required by each operating system version are downloaded from a trusted patch source. By extracting metadata information of the update package, analyzing key attributes such as the applicable system and version number of the patch, and using a machine learning clustering algorithm, the patch files are automatically classified into corresponding directories, including:

Obtain the server resources available within the government network and determine the hardware configuration and network environment for building the mirror server;

Obtain security update files for each operating system version from a trusted patch source and download them to the designated storage location of the mirror server;

Extract metadata information from downloaded patch files to obtain key attributes such as applicable systems and version numbers of the patches.

Based on the extracted patch key attributes, the K-means clustering algorithm is used to automatically classify patch files according to applicable systems and version numbers;

The classification results obtained by the clustering algorithm are used to determine the operating system and version to which each patch file belongs, and then the patch file is moved to the corresponding directory on the mirror server.

Configure Nginx and other Web services on the mirror server to provide patch file download services to ensure that all systems within the government network can access and download the required security updates;

Continuously monitor the updates of the trusted patch source, automatically download the newly released patch files through scheduled tasks, and repeat the above classification and release process to keep the patch library of the mirror server synchronized with the trusted source.

5. The method according to claim 1 is characterized in that, for the patch files in the mirror server, key features are extracted and an intelligent classification model is trained. When a new patch file is added, the model can accurately identify the applicable scope of the patch and automatically put it into the corresponding directory to ensure that different operating system versions can find matching update files, including:

Obtain metadata of all patch files in the image server, parse the file metadata to obtain the file name, size, hash value and related information list, and obtain a preliminary classification collection of patch files of different operating systems;

Determine the operating system field and the version field according to the list of associated information of each patch file in the preliminary classification set, extract the operating system version information applicable to the patch file, and annotate the patch file set according to the version information to obtain sample data;

Parse sample data, generate binary data stream, and perform numerical statistics based on the binary data stream.

6. The method according to claim 1 is characterized in that the monitoring agent is deployed on the cloud host to collect the operation indicators and log information of each service process in real time, the collected data is aggregated to a centralized log analysis platform, the logs are parsed and data mined through a batch computing framework, and a behavior model of service operation is constructed, including:

Deploy the agent to collect the operating indicator values and log streams of each process in the cloud host service process set and upload the data;

According to the preset log stream data format specification, the log analysis platform receives the data uploaded by the agent, performs data classification operations on the log streams generated by different service processes, and different classifications are marked differently;

Through the data content of each field in the log stream, the Taichung parser module performs information extraction operations to obtain key log information, logs of different service processes, and determine the identifiers of different service processes;

Use pre-trained machine learning models to perform anomaly detection operations on different processes, input log key information and identifiers into the binary classification model, and obtain the normal or abnormal classification results of each process;

Obtain the running index value of each process, build a running index value set for each process, input the running index value set into the pre-trained regression model, determine the regression model output, and obtain the performance prediction result of each process;

According to the output results of the regression model and the operating index values, the calculation block performs statistical analysis on the load status of each service process in the service process set, and obtains a behavior sequence by comparing the process load status of similar behaviors in historical data;

A behavior library is constructed based on the behavior sequence and the current service process behavior. The behavior sequence is represented by a Markov model to determine the next behavior of the service process.

7. The method according to claim 1 is characterized in that, for the data in the log analysis platform, a machine learning algorithm is used to train an anomaly detection model through supervised learning. The model can automatically discover various abnormal behaviors in the service operation process from massive log data and generate alarm information, including:

Construct a log dataset based on the log content. The log dataset should cover various log entries of historical data records, where normal and abnormal log entries are marked with data labels.

Perform data preprocessing on the labeled log data set, extract features, obtain a numerical data set, and divide it into a training set and a test set. The proportion of samples with different service states in the training set should be balanced.

The support vector machine algorithm is trained through the training set to obtain an initial detection model, the test set is tested through the initial detection model, and various indicators of the initial detection model for the test set detection results are calculated;

Obtain the log content generated in the real environment and perform data preprocessing, and obtain the previously trained model of the time series as the benchmark detection model. Use the benchmark detection model to process these log data, generate preliminary prediction labels, and then perform label determination;

The label is determined based on the preliminary predicted label. For the preliminary predicted label, if the preliminary predicted label of a log record is abnormal, and the time and log content of the log record meet the matching conditions with the log data of an abnormal record in the existing label database, the final predicted label is determined to be abnormal.

Perform anomaly determination on the final predicted labels. If the final predicted label of a log is abnormal, determine the alarm level corresponding to the log, use different levels of alarm schemes to issue an alarm, record the final labeling results in the labeled database, and then continue to perform the determination. If it is determined that the final predicted labels are all normal, it is determined that the detection model has defects.

Determine the ratio of false positives in the labeled database. If it exceeds the set ratio, take out all abnormal label samples in the labeled database for retraining to obtain a new time series detection model, and then redeploy the model;

It also includes: obtaining abnormal logs by analyzing attributes such as log level and event time, judging the abnormal type according to attributes such as request path and response status, and using random forest algorithm to train the abnormal detection model to obtain the abnormal detection results.

8. The method according to claim 7 is characterized in that the abnormal log is obtained by analyzing the attributes such as log level and event time, the abnormal type is judged according to the attributes such as request path and response status, and the abnormal detection model is trained by random forest algorithm to obtain the abnormal detection result, including:

Obtain the log data generated by the service operation, parse each log data to include the log level and event time attribute information, and if the log level meets the predetermined abnormal level, extract the log as an abnormal log to obtain multiple abnormal log samples;

Obtain each exception log, parse the request path and response status attribute information of each exception log, match the request path with the pre-built path rule library to obtain matching information, match the response status with the pre-determined exception status table to obtain exception status information, and if the matching information matches a certain type and the exception status matches a certain type, then combine the two to determine the exception type and obtain the exception type of each exception log;

Extract abnormal log features. Feature extraction includes abnormal log text numerical features and abnormal log time series features. The abnormal logs are divided into training set and test set according to the number. Half of the abnormal logs in the training set are used to build the model, and the other half are used for model training to obtain all the training set data.

Determine the abnormal log feature set of the training set according to all training set data generated during the data preprocessing operation;

Construct the feature matrix of each abnormal log in the training set;

A feature matrix is constructed for the abnormal log feature set of all training sets. The information gain rate of each abnormal log feature of the training set is calculated according to the statistical value of each abnormal log feature of the training set in different categories. The abnormal log features of the training set are sorted according to the information gain rate. The sorted feature matrix is input into the constructed random forest model in sequence. The random forest model is trained and iterated until the convergence condition is reached to obtain a trained random forest model.

Obtain each abnormal log of the test set during the data preprocessing operation, determine the feature set of each test set abnormal log, obtain the statistical value of each test set abnormal log in different categories, and calculate the information gain rate of each test set abnormal log feature;

Sort the abnormal log features of the test set by information gain rate;

Get the sorted test set feature matrix;

According to the trained random forest model, the sorted test set abnormal log feature matrix is input, and forward operation is performed to obtain the predicted output value, which is compared with the test set label value to obtain the accurate prediction value of each test set abnormal log;

Calculate the average value to get the model accuracy;

Determine whether to proceed to the next round of iteration based on accuracy;

After the iteration, the optimal detection model is obtained;

Receive online logs through the interface, parse the log level and time information contained in the log, and obtain log samples if the log level is the preset abnormal level;

Parse the request path and response status information of the log, match the path rules according to the request path to obtain the matching information, and match the status according to the response status to obtain the status information. If the matching information and the status information match the preset anomaly type, the log is classified as abnormal. Perform detection operations on the abnormal log according to the optimal detection model to obtain the detection result of whether the log is abnormal.

9. The method according to claim 1 is characterized in that when the monitoring system finds that a service is abnormal, it automatically notifies the operation and maintenance personnel of the abnormal information according to the pre-configured alarm rules, and triggers the automated emergency handling process at the same time. Different handling measures are taken according to the severity and impact scope of the abnormality, such as restarting the service process, rolling back the patch version, etc., including:

According to the collected server cluster log information, log information stream is formed through log structured processing, and the unstructured abnormal text in the log information stream is identified by natural language processing algorithm. The vector corresponding to the abnormal text is compared with the vector in the pre-established feature library, and the matching degree is calculated. If the matching degree exceeds the preset threshold of 85%, the alarm program is triggered;

Use the time series prediction model to analyze the matching degree sequence and predict the time point of subsequent error log occurrence. If the time point sequence is dense, the conclusion of service process failure is obtained, and the system is judged to have reached a serious level. The historical status information is obtained as a feature parameter and passed into the pre-built multi-modal fault prediction model to obtain the output result that the service needs to be restarted;

A recurrent neural network model is trained for the fault information sequence, and a set of troubleshooting solutions is generated by predicting the next character in the information sequence. By comparing the actual service log content, the confidence scores of different steps in the set of troubleshooting solutions are performed to obtain the best troubleshooting strategy. It is determined that the first step of this strategy requires obtaining the modification record of the code management library.

Extract recent code modification record files and historical version information files from the code management library, analyze the differences between each file to obtain the associated file group, and calculate the difference ratio before and after modification for each file in the associated file group. If the difference ratio exceeds the limit, mark the modification record file as a suspicious file;

The suspicious file information is transferred to the vulnerability scanning tool to perform file security assessment. An abstract syntax tree is created based on the code syntax structure. The number of vulnerabilities is obtained by traversing each node of the abstract syntax tree and performing comparison operations. When it is determined that the security threshold is exceeded, the correlation between the marked file and the vulnerability is determined.

If the number of vulnerabilities exceeds the preset threshold of 8, obtain the patch information set deployed recently, vectorize the patch information set, use vector space model technology to represent the current version status, use version comparison tools to obtain the difference point set between the current version status and the historical version status, and obtain the patch information file set that needs to be rolled back;

A reverse operation instruction set is constructed based on the patch information file set that needs to be rolled back and the current version information. A clustering analysis algorithm is used to perform clustering analysis operations on the new log information vector generated after the instruction execution. By evaluating the tightness within the clustering result cluster, the risk level of the instruction set is judged to determine whether to execute the reverse operation instruction set.

10. The method according to claim 1 is characterized in that, in the operation and maintenance knowledge base, the configuration parameters, log formats, common faults and other information of different operating systems and services are summarized and summarized to form a set of standardized operation and maintenance specifications and processes. Through the knowledge graph technology, these knowledge are associated and reasoned to form an intelligent operation and maintenance decision support system to provide diagnosis and disposal suggestions for operation and maintenance personnel, including:

Based on the operating system and service configuration information in the operation and maintenance knowledge base, the knowledge graph technology is used to associate and infer information, obtain standardized configuration parameter templates, and form a configuration specification library;

Use natural language processing technology to parse and extract the log format in the operation and maintenance knowledge base, obtain key fields and event types, build a log parsing model, and realize automatic analysis of logs;

For common fault information in the operation and maintenance knowledge base, we use knowledge graph technology to perform correlation analysis to obtain the cause, impact range and solution of the fault, and form a fault diagnosis knowledge base;

Based on the configuration specification library, log parsing model and fault diagnosis knowledge base, a rule-based reasoning engine is used to realize automated decision-making of the operation and maintenance process and generate standardized operation and maintenance operation steps;

Through machine learning algorithms, historical operation and maintenance data is trained to obtain classification models and prediction models for operation and maintenance events, thus achieving early warning and early disposal of potential failures.

Integrate the operation and maintenance decision support system with the monitoring platform to obtain real-time operating status data of systems and services, determine whether there are abnormalities through knowledge graph reasoning and machine learning models, and provide diagnostic suggestions;

Generate automated handling scripts based on diagnostic recommendations and standardized operation and maintenance procedures, and execute them through the operation and maintenance automation platform to achieve rapid fault repair and system recovery, thereby improving operation and maintenance efficiency and system availability.