[go: up one dir, main page]

CN120085885A - A method for updating an operating system based on cloud services - Google Patents

A method for updating an operating system based on cloud services Download PDF

Info

Publication number
CN120085885A
CN120085885A CN202411992030.5A CN202411992030A CN120085885A CN 120085885 A CN120085885 A CN 120085885A CN 202411992030 A CN202411992030 A CN 202411992030A CN 120085885 A CN120085885 A CN 120085885A
Authority
CN
China
Prior art keywords
log
patch
information
abnormal
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411992030.5A
Other languages
Chinese (zh)
Inventor
韦宗星
欧东
吴珍兴
潘杨华
李新彰
吴大明
李滔滔
杨昌富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Gongcheng Yunwang Digital Intelligence Industry Development Co ltd
Original Assignee
Guizhou Gongcheng Yunwang Digital Intelligence Industry Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Gongcheng Yunwang Digital Intelligence Industry Development Co ltd filed Critical Guizhou Gongcheng Yunwang Digital Intelligence Industry Development Co ltd
Priority to CN202411992030.5A priority Critical patent/CN120085885A/en
Publication of CN120085885A publication Critical patent/CN120085885A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/65Updates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides an operating system updating method based on cloud service, which comprises the steps of acquiring operating system versions and patch installation conditions of different hosts by scanning system information of cloud hosts, dividing the hosts into corresponding management groups according to operating system types and version numbers, formulating unified patch strategies and updating plans for each group, deploying monitoring agents on the cloud hosts, collecting operation indexes and log information of each service process in real time, collecting collected data to a centralized log analysis platform, analyzing and mining logs through a batch calculation framework to construct a service operation behavior model, and training an abnormality detection model by using a machine learning algorithm in a supervised learning mode for the data in the log analysis platform, wherein the model can automatically discover various abnormal behaviors in the service operation process from massive log data and generate alarm information.

Description

Cloud service-based operating system updating method
Technical Field
The invention relates to the technical field of information, in particular to an operating system updating method based on cloud services.
Background
Problem background:
In the data center of the government network, the version of the operating system of the cloud host is uneven, and some are Windows Server 2008, windows Server 2012 and Windows Server 2016. The patches required for different operating systems are not identical. How to build a unified patch management mechanism to enable operating systems with different versions to obtain the latest security patch in time is a difficult problem for operation and maintenance personnel.
Furthermore, because the network environment of the government network is relatively closed, the cloud host cannot directly connect to microsoft's official update server. Therefore, it is urgent to create a mirror update server inside the government network. However, the update files required by the operating systems of different versions are quite different, and how to establish a set of automatically classified and intelligently identified mirror image sources enables the different operating systems to find the corresponding patch files, which requires operation and maintenance personnel to deeply study the internal structure of the operating systems to find the commonalities and differences among the various versions.
Meanwhile, various services running on the cloud host also need real-time monitoring by operation and maintenance personnel. Different services have different log formats and different operation characteristics, and operation and maintenance personnel need to develop an intelligent log analysis system, so that abnormal behaviors of various services can be accurately found and early warned in time. This, in turn, requires the operator to have a deep understanding of the business logic of the various services, knowing which actions are normal and which are abnormal. In summary, in the operation and maintenance work of the government network, operation and maintenance personnel need to understand the technical details of the operating system and the service deeply, find commonalities in various heterogeneous environments, and establish an automatic and intelligent operation and maintenance mechanism to protect the driving for the safe and stable operation of the government network.
Disclosure of Invention
The invention provides an operating system updating method based on cloud service, which mainly comprises the following steps:
the method comprises the steps of acquiring operating system versions and patch installation conditions of different hosts by scanning system information of cloud hosts, dividing the hosts into corresponding management groups according to operating system types and version numbers, and formulating a unified patch strategy and an update plan for each group;
Setting up a mirror image server in the government network, downloading security update files required by versions of each operating system from a trusted patch source, analyzing key attributes such as applicable systems, version numbers and the like of patches by extracting metadata information of update packages, and automatically classifying the patch files into corresponding catalogues by adopting a clustering algorithm of machine learning;
Aiming at patch files in a mirror image server, key features are extracted, an intelligent classification model is trained, when a new patch file is added, the model can accurately identify the application range of the patch, and the patch file is automatically placed in a corresponding directory, so that different operating system versions can find matched updated files;
Deploying a monitoring agent on a cloud host, collecting running indexes and log information of each service process in real time, collecting collected data to a centralized log analysis platform, analyzing the logs and mining the data through a batch calculation framework, and constructing a behavior model of service running;
Aiming at the data in the log analysis platform, a machine learning algorithm is used, and an abnormality detection model is trained in a supervised learning mode, and can automatically discover various abnormal behaviors in the service running process from massive log data and generate alarm information;
when the monitoring system finds that a certain service is abnormal, according to a pre-configured alarm rule, automatically notifying the operation and maintenance personnel of the abnormal information, triggering an automatic emergency treatment process, and adopting different treatment measures, such as restarting a service process, rolling back a patch version and the like, according to the severity and the influence range of the abnormality;
In the operation and maintenance knowledge base, the configuration parameters, log formats, common faults and other information of different operation systems and services are summarized and generalized to form a set of standardized operation and maintenance specifications and processes, and the knowledge is associated and inferred through knowledge graph technology to form an intelligent operation and maintenance decision support system, so that diagnosis and treatment suggestions are provided for operation and maintenance personnel.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
The invention discloses an intelligent cloud host patch management and anomaly detection method. And dividing and managing the hosts according to the type and version of the operating system by scanning the system information of the cloud host, and formulating a unified patch strategy. And (3) setting up a mirror image server in the government network, and automatically classifying patch files by adopting a machine learning clustering algorithm. The deployment monitoring agent collects service operation data, and the machine learning training abnormality detection model is utilized to find abnormality and alarm from massive logs. And constructing an operation and maintenance decision support system by combining a knowledge graph technology, and providing diagnosis suggestions for operation and maintenance personnel. The cloud host patch intelligent classification and automatic updating method and the cloud host patch intelligent classification and automatic updating method realize intelligent classification and automatic updating of the cloud host patch, real-time monitoring and rapid handling of service anomalies, improve safety and stability of a cloud platform and reduce operation and maintenance cost.
Drawings
Fig. 1 is a flowchart of an operating system updating method based on cloud service according to the present invention.
Fig. 2 is a schematic diagram of an operating system updating method based on cloud services according to the present invention.
Fig. 3 is a schematic diagram of a cloud service-based operating system updating method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
1-3, The cloud service-based operating system updating method in this embodiment may specifically include:
S101, acquiring operating system versions and patch installation conditions of different hosts by scanning system information of cloud hosts, dividing the hosts into corresponding management groups according to operating system types and version numbers, and making a unified patch strategy and an update plan for each group.
And remotely connecting to the cloud host through an API (application program interface) or an SSH (secure shell) protocol, executing a system command, and acquiring system information such as the type, version number, patch installation list and the like of an operating system of the cloud host. And judging the management group of each cloud host according to the acquired cloud host system information, a decision tree algorithm and a preset operating system type and version number rule. In the configuration management database, according to the unique identifier of the cloud host, updating the field of the management group to which the cloud host belongs, and dividing the cloud host into corresponding groups. For each management group, an operation and maintenance person formulates a unified system patch strategy according to the characteristics of the operation system of the group, and determines a patch list and an installation priority to be installed. And converting the patch strategy into an executable script program, and adopting Ans ible and other automatic operation and maintenance tools to issue patch installation tasks to all cloud hosts in the group in batches. In the patch installation process, the installation progress and the result of each cloud host are obtained in real time and recorded in a system log, so that tracking and auditing are facilitated. After the patch is installed, the system information of the cloud host is scanned again, the latest patch installation condition is obtained, whether the patch installation condition is consistent with the patch strategy is judged, if not, differential updating is carried out, and the cloud host in each group is ensured to meet the requirement of the patch strategy.
Illustratively, the remote connection cloud host acquiring system information is the basis of cloud platform management. Taking a cloud platform as an example, a virtual machine list can be obtained through RESTfulAPI interface call, and then each virtual machine is logged in to execute a system command by using an SSH protocol. Common commands include "uname-a" get operating system type and version, "rpm-qa" or "dpkg-l" list installed packages. This information can be used for subsequent grouping and patch management. The decision tree algorithm can efficiently group cloud hosts. For example, the operating system type (Windows/Linux) may be determined first, then the specific release (CentOS/Ubuntu, etc.) may be determined, and finally the major version number (e.g., centOS 6/7) may be determined. this hierarchical decision may quickly partition the cloud host into appropriate management packets. The decision tree has the advantages of clear rules and high execution efficiency, and is convenient for subsequent maintenance and adjustment of grouping strategies. A Configuration Management Database (CMDB) is the core of IT asset management. In the CMDB, each cloud host has a unique identifier, such as an asset number or instance ID. When the packet to which the cloud host belongs is updated, the field of the management packet is modified only by locating the corresponding record according to the identifier. This centralized management approach may ensure consistency and traceability of asset information. Several factors are considered in formulating the patch policy. Taking Linux system as an example, the patching policy of the CentOS7 may include installing all security updates, installing kernel patches of a specific version, excluding certain patches that may affect the service, etc. Policies should also take into account the order of installation, such as installing the dependent packages first and then installing the primary patches. The careful policy formulation can ensure the safety and stability of the system to the maximum extent. Converting patch policies into executable scripts is the key to automation. Taking Ans ible as an example, playbook in YAML format can be written, defining a sequence of tasks to be performed. The command module is used to execute specific shel l commands, such as installing a specified software package using yum modules. Ans ible is advantageous in that its declarative syntax and idempotency can be performed multiple times on the same playbook without causing system state disturbances. Monitoring patch installation progress in real time is critical for large-scale deployment. A callback function may be added to the Ans ible tasks, and the execution result of each step is written into the log system in real time. A common log system such as ELKstack (ELASTICSEARCH, LOGSTASH, KIBANA) can provide strong log aggregation and visualization capability, so that an operator can conveniently master the patch deployment condition in real time. Verification after patch installation is a necessary step to ensure that policies are executed in place. The system information collection script may be executed again, comparing the list of packages and version numbers before and after installation. If a discrepancy is found, a discrepancy report may be generated and a decision may be made as to whether additional patch installation or rollback operations are required based on the report content. The closed loop management can effectively improve the accuracy and the integrity of patch management. The automation of the whole flow not only improves the efficiency, but also reduces the risk of human errors. For example, manual grouping may result in some specially configured hosts being misclassified, while using algorithms may ensure consistency of classification. The automatic patch deployment can also ensure that all hosts are updated according to the same flow and standard, so that inconsistent system configuration caused by human negligence is avoided. However, automation also presents new challenges. For example, how to handle accidents that may occur during patch installation, such as network outages or insufficient disk space. This requires adding more error handling and retry mechanisms to the script. In addition, for critical business systems, it may be necessary to add a manual validation step to the automated process to balance efficiency and security. In general, the automatic grouping and patch management method based on the system information can greatly improve the management efficiency and consistency of a large-scale cloud environment. By converting the human experience into algorithms and automated scripts, not only can the repetitive work be reduced, but also the precise execution of the management policies can be ensured. meanwhile, a perfect log record and verification mechanism also provides a basis for subsequent audit and optimization.
And acquiring cloud host operating system versions and patch installation conditions through scanning, dividing management groups according to system types and version numbers, making unified patch strategies and updating plans for the groups, and determining patch updating schemes of all cloud hosts.
And comprehensively scanning the cloud hosts through a scanning tool to acquire the type and version number of the operating system of each cloud host and the installed patch information. And according to the acquired type and version number of the operating system, automatically grouping the cloud hosts by adopting a clustering algorithm, wherein the cloud hosts in each group have similar operating system characteristics. And analyzing the patch installation condition of the cloud hosts in the group for each group, identifying the existing patch missing and vulnerability risk, and determining the priority of patch updating according to the risk level. And combining cloud host grouping information and patch priority, making a unified patch strategy and an updating plan of each group, and defining patch updating time nodes and specific operation steps. When a patch strategy is formulated, the dependency relationship and compatibility among different patches are analyzed through an association rule mining algorithm, so that the reasonability and the safety of patch updating are ensured. According to the unified patch strategy and the updating plan, a personalized patch updating scheme is generated for each cloud host, and a patch list and a specific operation flow which need to be installed are definitely displayed. And an automatic operation and maintenance tool is adopted to update and repair patches of the cloud host in batches according to a patch updating scheme, and the stability and controllability of the patch updating process are ensured through a monitoring and rollback mechanism.
Illustratively, cloud host scanning is the basis for patch management. Cloud host information may be obtained comprehensively through specialized scanning tools such as Nessus or OpenVAS. For example, the scan result may show that a host runs centOS7.6, has installed a kernel-3.10.0-957.el7 patch, but lacks the latest security updates. This detailed information provides basis for subsequent grouping and policy formulation. The clustering algorithm may automatically group similar hosts. For example, using the K-means algorithm, featuring operating system type and version, hosts can be categorized into Window Server2016 group, ubuntu18.04 group, and so on. This grouping approach makes subsequent patch management more targeted and efficient. Patch analysis is a key step in identifying risk. By comparing the installed patch with the newly released patch, a potential vulnerability can be discovered. If a certain windows server group lacks an MS17-010 patch, a high-risk persistent blue hole exists, and the highest update priority should be given. This risk-based prioritization ensures that the most critical security issues are resolved in time. Making a patch policy requires comprehensive consideration. In addition to security, traffic continuity is also a consideration. For example, for critical business systems, it may be desirable to update during off-peak hours and reserve rollback time. In particular, the database server may schedule patch updates at 2-4 hours in the morning on weekends, leaving a 2 hour observation period. Such careful planning minimizes the impact on traffic. Analysis of dependencies between patches is critical. Through association rule mining, for example, can be found. Netfiamework updates must precede certain application patch installations and the like. The analysis can avoid unstable system caused by patch conflict and improve updating success rate. The personalized patch scheme considers the characteristics of each host. For example, for Web servers, in addition to conventional system patches, special attention is required to secure updates of Apache or Nginx. For database servers, patches to Oracle or MySQL are important considerations. This customization scheme ensures that each host gets the most appropriate updates. Automated tools such as Ans ible greatly improve patch update efficiency. Batch and orderly patch installation can be realized by writing playbook. For example, it may be provided to update the non-critical servers first, and then to update the core service servers after a period of observation. Meanwhile, by setting check points and rollback mechanisms, if a certain patch is found to cause unstable system, the system can be quickly restored to a pre-update state, and the controllability of the whole process is ensured. This series of steps forms a complete patch management closed loop. From scanning, analysis and strategy formulation to execution and monitoring, each link is carefully designed and mutually supported. The systematic method not only improves the security of the cloud host, but also optimizes the efficiency and reliability of the whole patch management flow. By continuously improving the process, the overall safety level and the operation and maintenance quality of the cloud environment can be continuously improved.
S102, setting up a mirror server in a government network, downloading security update files required by versions of each operating system from a trusted patch source, analyzing key attributes such as applicable systems, version numbers and the like of patches by extracting metadata information of update packages, and automatically classifying the patch files into corresponding catalogues by adopting a clustering algorithm of machine learning.
And obtaining available server resources in the government network, and determining the hardware configuration and network environment for building the mirror image server. And acquiring the security update files of each operating system version from the trusted patch source, and downloading the security update files to the appointed storage position of the mirror image server. And extracting metadata information of the downloaded patch file to obtain key attributes such as a suitable system and version number of the patch. And according to the extracted key attributes of the patch, adopting a K-means clustering algorithm to automatically classify the patch files according to the applicable system and version numbers. And judging the operating system and version of each patch file according to the classification result obtained by the clustering algorithm, and moving the operating system and version to the corresponding directory of the mirror image server. And configuring Web services such as Nginx and the like on the mirror server, providing a downloading service of the patch file, and ensuring that each system in the government network can access and download the required security update. Continuously monitoring the updating of the trusted patch source, automatically downloading the newly released patch file through a timing task, and repeatedly executing the classification and release processes to keep the patch library of the mirror server and the trusted source updated synchronously.
Illustratively, setting up a mirror server inside a government network is a key step for ensuring that the system is updated safely, timely and reliably. First, the network environment and hardware resources need to be evaluated, and a suitable server configuration is selected. For example, a dual-path IntelXeon processor, a 128GB memory, and a server with 10TB storage space may be selected, and a gigabit network interface is provided to meet the storage and high concurrent downloading requirements of a large number of patch files. The acquisition of a secure update file from a trusted patch source is the basis for ensuring the reliability of the patch. Patch files, such as WindowsUpdate, REDHATSATEL L I TE, etc., may be synchronized from an official source by means of rsync, etc. The downloaded patch files need to be preliminarily classified according to the type and version of the operating system and are stored under the appointed directory of the mirror server. Extracting patch metadata information is key to achieving automated classification. The script can be used for analyzing the description information of the patch file, and the applicable attributes such as an operating system, a version number, a release date and the like can be extracted. For example, for Windows patches, resolution may be possible. XML description of msu file, metadata of RPM or DEB packets can be parsed for Linux patches. The automatic classification of patches by adopting a K-means clustering algorithm is an effective method for improving efficiency. The key attributes of the patch are first converted into a vector of values, e.g., the operating system type may be mapped to different values. The number of cluster centers, such as the sum of the number of versions Windows, linux, is then set. Similar patches are aggregated into the same category by iterative computation. The method can effectively process a large number of patch files and automatically identify a new operating system version. According to the clustering result, the patch file can be moved to the corresponding directory structure. For example, directories such as Windows/Server2016 and Linux/CentOS7 can be created, and clustered patch files are stored according to categories, so that subsequent management and downloading are facilitated. Configuration of Web services is a key to providing patch download services for government network internal systems. The Nginx can be used as a Web server to configure a virtual host and access control, and only IP access inside the government network is allowed. Meanwhile, the buffer memory and compression functions can be started, and the downloading efficiency is improved. To ensure security, HTTPS may be configured to encrypt transmissions using self-signed certificates. Continuously monitoring patch sources and automatically updating is an important means to maintain timeliness of the mirror server. A timed task may be written to synchronize new patch files from an official source every morning. After synchronization is completed, the patch classification and release process is automatically triggered, so that the new patch can be ensured to be provided for the internal system of the government network in time. The automatic mechanism can greatly reduce manual intervention and improve the efficiency and accuracy of patch management. Through the steps, an efficient and reliable patch mirror image service can be built in the government network, timely safety update is provided for each system, and the overall network safety is effectively improved. The local mirror image service not only can accelerate the patch distribution speed, but also can effectively control the patch source and reduce the external network dependence, and is an important safeguard measure for the security operation and maintenance of the government network.
S103, extracting key features aiming at patch files in the mirror image server, training an intelligent classification model, and when new patch files are added, accurately identifying the application range of the patch, automatically putting the patch files into corresponding catalogues, and ensuring that matched updated files can be found out by different operating system versions.
And acquiring metadata of all patch files in the mirror image server, analyzing the metadata of the files to obtain file names, sizes, hash values and associated information lists, and obtaining a preliminary classification set of patch files of different operating systems. And determining an operating system field and a version field according to each patch file association information list in the preliminary classification set, extracting the version information of the operating system applicable to the patch files, and labeling the patch file set according to the version information to obtain sample data. Analyzing the sample data to generate a binary data stream, carrying out numerical statistics, such as byte frequency, according to the binary data stream, judging that the occurrence frequency of a specific byte accords with normal distribution, and calculating a file difference feature vector through a statistical feature value construction method. And (3) training by adopting a support vector machine algorithm and combining the extracted file difference feature vectors to obtain a supervised learning classifier, and repeatedly and iteratively calculating according to the labeling data to judge that the error value between the predicted value and the labeled value is continuously reduced, so that the accuracy of the training result is finally converged and determined. Obtaining a new patch file, generating a binary data stream, carrying out numerical statistics according to the binary data stream, and constructing a file difference feature vector of the new patch file according to the input feature value requirement defined by a classifier. And inputting a file difference characteristic vector of the new patch file into a supervised learning classifier, obtaining file application range classification, obtaining a classification result, and determining a specific operating system version directory in which the patch is to be placed. And according to the classification result, migrating the newly added patch file to a specific classification directory in the server, updating the patch file index, and judging that the downloading list content of the patch file associated with the operating system is completely refreshed, so that all online users of the server are automatically synchronized.
Illustratively, acquiring metadata of all patch files in the mirror server first requires traversing the server storage directory and reading metadata information of each patch file using the file system API. For example, basic information such as file name, size, hash value and the like is acquired by using a 'stat' command of a Linux system or a 'GetFi leAttributesEx' function of a Windows system. The hash value can be calculated through the SHA-256 algorithm, so that the integrity and the uniqueness of the file are ensured. The associated information list includes the type of operating system, version number, etc. to which the patch applies, and this information is typically stored in metadata or an attached description file of the patch file. After analyzing the file metadata, the patch files are primarily classified according to the operating system field and the version field. For example, the patch file of the Windows system may include fields such as "Windows10" and "Window Server2016", and the patch file of the Linux system may include fields such as "Ubuntu20.04" and "CentOS 7". And extracting the key information through a regular expression matching or character string analysis method to form a preliminary classification set. And further extracting the version information of the operating system suitable for the patch file on the basis of the preliminary classification set. For example, if a patch file metadata contains "applicable to Windows10 version 1809", windows10 and 1809 are extracted as key information. And labeling the patch file set according to the information to form sample data. Sample data may be represented in the form of { (file name, size, hash value, operating system, version number) }. Analyzing the sample data to generate a binary data stream, and analyzing the characteristics of the file by counting byte frequencies in the binary data stream. For example, counting the frequency of occurrence of each byte value in the file, if the frequency distribution of a specific byte accords with the normal distribution, the file difference feature vector can be constructed by using the statistical feature values. The feature vector may contain multiple dimensions, such as the number of times a high frequency byte occurs, the variance of the byte distribution, etc. And training a supervised learning classifier by adopting a Support Vector Machine (SVM) algorithm and combining the extracted file difference feature vectors. In the training process, the parameters of the classifier are continuously adjusted through repeated iterative computation, so that the error value between the predicted value and the marked value is continuously reduced. When the error value converges below a certain threshold, the training result is considered to reach a higher accuracy. For example, after 1000 iterations, the error value drops from 0.5 to 0.01, indicating that the classifier has learned the features of the sample data better. After the newly added patch file is obtained, a binary data stream is generated as well, and byte frequency statistics is carried out. And constructing a file difference feature vector of the new patch file according to the input feature value requirement defined by the classifier. For example, the byte frequency statistics of the new patch file show that the frequency of occurrence of a particular byte is highly similar to a certain class in the training sample, and the feature vector is input to the trained classifier. And determining an operating system and a version applicable to the new patch file according to the result output by the classifier. For example, if the output result of the classifier is "Windows10 version 1903", the patch file is migrated to the corresponding "Windows10/1903" directory in the server. Meanwhile, the index of the patch file is updated, so that each system in the government network can acquire the latest patch information in time. And finally, judging whether the content of the download list of the related patch file of the operating system is completely refreshed. For example, by checking the cache update status of each system patch management tool, it is ensured that all online users can automatically synchronize the latest patch files. By the aid of the method, the automation level of patch management is improved, and safety and stability of a government network internal system are guaranteed. Through the steps, automatic classification and efficient management of patch files are realized by using a machine learning and statistics method, and the safety protection capability of the government network is improved. Each step is looped, so that a strict logic relation and thinking chain are formed, and the accuracy and timeliness of patch management are ensured.
And S104, deploying a monitoring agent on the cloud host, collecting operation indexes and log information of each service process in real time, collecting collected data to a centralized log analysis platform, analyzing the logs and mining the data through a batch calculation framework, and constructing a behavior model of service operation.
The deployment agent terminal collects the running index value and log stream of each process in the service process set of the cloud host, and uploads data. According to a preset log stream data format specification, the log analysis platform receives the data uploaded by the proxy end, performs data classification operation on log streams generated by different service processes, and obtains different identifications by different classifications. And through the data content of each field in the log stream, the analyzer module in the analysis platform executes information extraction operation to obtain log key information, logs of different service processes and determine the identification of the different service processes. And performing abnormality detection operation of different processes by adopting a pre-trained machine learning model, and inputting log key information and identification into a classification model to obtain normal or abnormal classification results of each process. Acquiring the running index value of each process, constructing a running index value set of each process, inputting the running index value set into a pre-trained regression model, determining the output of the regression model, and obtaining the performance prediction result of each process. And according to the regression model output result and the operation index value, the calculation block performs statistical analysis operation on the load state of each service process in the service process set, and obtains a behavior sequence by comparing the process load conditions of similar behaviors in the historical data. And constructing a behavior library according to the behavior sequence and the current service process behavior, wherein the behavior sequence is represented by a Markov model, and determining the next behavior of the service process.
The deployment agent terminal collects running index values and log streams of all service processes on the cloud host and uploads the running index values and log streams to the log analysis platform. It is assumed that a cloud host is running three service processes of a Web server, a database server and a cache server. The agent end can monitor running indexes such as CPU utilization rate, memory occupation, network flow and the like of the processes in real time, and collect log information generated by the processes. The log stream data format specification is preset to be JSON format and comprises a time stamp, a process ID, a log level, log content and other fields. For example, the log of the Web server may contain information such as URL of HTTP request, response time, etc., the log of the database server may contain information such as SQL query statement, execution time, etc., and the log of the cache server may contain information such as cache hit rate and expiration time, etc. After receiving the data uploaded by the proxy end, the log analysis platform firstly carries out classification operation on log streams generated by different service processes. By analyzing the process ID field in the log, the log stream is divided into three types of Web server log, database server log and cache server log, and different identifications are given. Next, the parser module performs an information extraction operation on each type of log. For example, the URL and response time are extracted from the Web server log, the SQL statement and execution time are extracted from the database server log, and the cache hit rate and expiration time are extracted from the cache server log. This key information will be used for subsequent anomaly detection and performance prediction. And adopting a pre-trained machine learning model to detect the abnormality. And assuming that a decision tree-based classification model is used, inputting log key information and identification, and outputting normal or abnormal classification results of each process. For example, if the response time of a Web server suddenly increases, the model may mark it as abnormal, and if the SQL query execution time of the database server is abnormally long, the model may mark it as abnormal. And acquiring the running index value of each process, and constructing a running index value set of each process. For example, the set of running index values for the Web server may include CPU utilization, memory usage, network traffic, and the like. These index values are input into a pre-trained regression model to predict the performance of each process. Assuming a regression model based on random forests is used, the model predicts performance metrics over a period of time in the future based on historical data. And calculating the load state of each service process according to the regression model output result and the current operation index value. For example, by comparing the CPU utilization of the current Web server with the process load conditions of similar behavior in the history data, it can be derived whether the load state of the current Web server is high, medium or low. And comparing the process load conditions of similar behaviors in the historical data to obtain a behavior sequence. For example, the historical data shows that the query execution time of the database server increases significantly each time the CPU usage of the Web server exceeds 80%. This sequence of actions can be represented by a Markov model, predicting the next action of the service process. And constructing a behavior library according to the behavior sequence and the current service process behavior. For example, the CPU utilization of the current Web server has reached 85%, and it is predicted that the query execution time of the database server may increase according to the markov model in the behavior library, so that optimization measures, such as increasing buffering or optimizing SQL queries, are taken in advance. Through the steps, the log stream can be monitored and classified in real time, the performance and abnormal behavior of the process can be predicted, the intervention is performed in advance, and the stability and the high efficiency of the system are ensured. Such technical effects include improving system availability, reducing fault response time, optimizing resource allocation, and the like. For example, in practical application, when the response time of the Web server is detected to be abnormally increased, the system can automatically trigger the capacity expansion operation and increase the server instance, so that the load pressure is relieved, and when the SQL query execution time of the database server is abnormally long, the system can automatically optimize query sentences or increase indexes, and the query efficiency is improved. Through the multidimensional monitoring and predicting mechanism, the service process on the cloud host can run in an efficient and stable environment, and the overall performance and user experience of the system are greatly improved.
S105, training an anomaly detection model by using a machine learning algorithm aiming at the data in the log analysis platform in a supervised learning mode, wherein the model can automatically discover various anomaly behaviors in the service operation process from massive log data and generate alarm information.
A log dataset is constructed from the log content that should overlay various log entries of the history data record, where normal and abnormal log entries have indicia of data tags. The method comprises the steps of preprocessing data of a log data set with a data tag, extracting features, obtaining a numerical data set, dividing the numerical data set into a training set and a testing set, and balancing sample proportions of different service states in the training set. And training the model supporting the vector machine algorithm through the training set to obtain an initial detection model, testing the testing set through the initial detection model, and calculating various indexes of the initial detection model aiming at the detection result of the testing set. And acquiring log contents generated in a real environment, preprocessing data, acquiring a trained model of a previous time sequence as a reference detection model, processing the log data by adopting the reference detection model, generating a preliminary prediction label, and then judging the label. And carrying out label judgment according to the preliminary prediction label, and determining that the final prediction label is abnormal if a certain log record is abnormal according to the preliminary prediction label, and the time and the log content of the log record and the log data of a certain abnormal record in the existing mark database accord with the matching condition. And carrying out abnormality judgment on the final prediction label, judging the alarm level corresponding to a certain log if the final prediction label of the log is abnormal, adopting alarm schemes of different levels to carry out alarm, recording a final marking result into a marked database, and then continuously executing judgment, and judging that the detection model has defects if the final prediction label is normal. Judging the false alarm quantity proportion in the marked database, if the false alarm quantity proportion exceeds the set proportion, taking out all abnormal label samples in the marked database for retraining to obtain a new time sequence detection model, and then redeploying the model.
Illustratively, constructing a log data set from log content first requires extracting various types of log entries, including normal and abnormal records, from the historical data. For example, in an order processing system of an e-commerce platform, a normal log may include entries of "order creation success", "payment completion", etc., and an abnormal log may include "payment failure", "stock shortage", etc. Each log record needs to be labeled with a data tag, such as "normal" or "abnormal". And in the data preprocessing stage, cleaning and characteristic extraction are carried out on the log data. Assuming a log record of "2023-10-0110:00:00 user 123 payment failed," the features extracted after preprocessing may include a time stamp, user ID, operation type, and result status, etc. Converting these features into numeric data, such as time stamping into Unix time stamping, operation type and result state are unithermally encoded. Dividing the preprocessed data into a training set and a testing set, and ensuring sample proportion balance of different service states in the training set. For example, the ratio of normal log to abnormal log in training set is 1:1 to avoid model bias to some kind of data. Model training is performed using a Support Vector Machine (SVM) algorithm. Assuming that the training set contains 10,000 log records, the SVM model establishes classification boundaries by learning the characteristics of these records. After training, the model is verified by using the test set, and indexes such as accuracy, recall rate, F1 score and the like are calculated. For example, the test set contained 2,000 records with a model accuracy of 90%, a recall of 85% and an F1 score of 0.875. In a real environment, new log data is continuously collected and preprocessed. Assuming 1,000 new logs are collected on a certain day, a previously trained benchmark test model is used for processing to generate a preliminary prediction label. If a log is preliminarily predicted to be abnormal and the time and the content of the log are matched with those of an abnormal record in the marked database, if the payment of the user 456 fails in 2023-10-0211:00:00, the final prediction label is confirmed to be abnormal. And performing abnormality judgment on the final prediction label. If a log is finally predicted to be abnormal, judging the alarm level according to the severity of the log. For example, a payment failure may trigger a medium-level alert, while a system crash triggers a high-level alert. The warning information informs related personnel in a mail, short message and other modes, and the final marking result is recorded in a marked database. If the false alarm number proportion in the marked database exceeds a set threshold, if the false alarm rate reaches 20%, the model is required to be retrained. And extracting all abnormal label samples from the database, retraining the SVM model by combining new data, generating a new time sequence detection model, and redeploying. When the behavior library is constructed, the behavior sequence is represented by a Markov model. Assuming that the historical behavior sequence of a certain service process is 'start-run-stop', the next behavior is predicted by a Markov model. If the current behavior is "running", the probability that the model predicts that the next behavior is "stop" is higher. The method has the advantages that the model can more accurately identify the abnormality through continuous learning and optimization, and false alarms and missing alarms are reduced. Meanwhile, by combining the historical data and the real-time data, the running state of the service process can be mastered more comprehensively, and the stability and reliability of the system are improved. In practical application, the construction of the log data set and the model training are a dynamic process, and continuous iteration and optimization are needed. By the method, the abnormality can be found and processed in time, and powerful support can be provided for optimization and upgrading of the system. for example, by analyzing the exception log, frequent error reporting of a certain interface is found, and the optimization can be performed pertinently, so that the system performance is improved. In summary, through the steps of constructing a log data set, preprocessing data, training and optimizing a model, judging abnormality, alarming and the like, the monitoring and management level of the service process on the cloud host can be effectively improved, and the stable operation of the system is ensured.
The method comprises the steps of obtaining an abnormal log by analyzing attributes such as log level and event time, judging the abnormal type according to the attributes such as request path and response state, training an abnormal detection model by adopting a random forest algorithm, and obtaining an abnormal detection result.
The method comprises the steps of obtaining log data generated by service operation, analyzing that each piece of log data contains log grade and event time attribute information, and if the log grade accords with a predetermined abnormal grade, extracting the log as an abnormal log to obtain a plurality of abnormal log samples. And acquiring each abnormal log, analyzing the request path and response state attribute information of each abnormal log, matching the request path with a pre-constructed path rule base to obtain matching information, matching the response state with a pre-determined abnormal state table to obtain abnormal state information, and if the matching information accords with a certain type and the abnormal state accords with a certain type, judging the abnormal type by combining the request path and the response state attribute information and obtaining the abnormal type of each abnormal log. The method comprises the steps of extracting abnormal log characteristics, wherein the characteristic extraction comprises abnormal log text numerical characteristics and abnormal log time sequence characteristics, the abnormal logs are divided into training sets and test sets according to the number, one half of the abnormal logs in the training sets are used for constructing a model, and the other half of the abnormal logs are used for model training, so that all training set data are obtained. And determining an abnormal log characteristic set of the training set according to all training set data generated during the data preprocessing operation. And constructing a feature matrix of each abnormal log of the training set. And aiming at feature matrixes constructed by the abnormal log feature sets of all training sets, calculating according to statistical values of the abnormal log features of each training set in different categories to obtain information gain rates of the abnormal log features of each training set, sequencing the abnormal log features of the training sets according to the information gain rates, sequentially inputting the sequenced feature matrixes into a constructed random forest model, training the random forest model, and carrying out cyclic iteration until convergence conditions are reached to obtain the random forest model for completing training. And obtaining each abnormal log of the test set during data preprocessing operation, determining each abnormal log feature set of the test set, and obtaining the statistical magnitude of each abnormal log of the test set in different categories to calculate and obtain the information gain rate of the abnormal log features of each test set. And sequencing the abnormal log characteristics of the test set according to the information gain rate. And obtaining the feature matrix of the ordered test set. And inputting the ordered abnormal log feature matrix of the test set according to the random forest model after training, performing forward operation to obtain a predicted output value, and comparing the label values of the test set to obtain the predicted accurate value of the abnormal log of each test set. And calculating an average value to obtain the model accuracy. And judging whether the next iteration is carried out according to the accuracy. And after the iteration is finished, obtaining an optimal detection model. And receiving an online log through an interface, analyzing log grade and time information contained in the log, and acquiring a log sample if the log grade is a preset abnormal grade. Analyzing the request path and the response state information of the log, carrying out path rule matching according to the request path to obtain matching information, carrying out state matching according to the response state to obtain state information, classifying the log into an abnormality if the matching information is matched with the state information by a preset abnormality type, and carrying out detection operation on the abnormality log according to an optimal detection model to obtain a detection result of whether the log is abnormal.
The log analysis platform is used for detecting the abnormality through a machine learning algorithm, and log data generated by service operation is needed to be acquired first. Taking a Web server as an example, the log may include information such as access time, request path, response status code, and the like. When analyzing log data, the important attention is paid to attributes such as log grade, event time and the like. For example, the ERROR level log is regarded as an abnormal sample, so that potential problem logs can be quickly screened out. For further analysis of the exception log, the request path and response state need to be parsed. Assuming that a log display "/api/user" path returns 500 status codes, it can be determined that this is a user interface server error by matching with predefined path rules and exception status tables. The method can quickly locate the problem, and is favorable for timely processing faults. Feature extraction is a key step in model training. For text-based logs, numerical features may be extracted using methods such as TF-IDF, and for time series data, time-dependent features may be extracted using sliding window techniques. For example, counting the frequency of occurrence of a particular error within 30 minutes helps to find periodic or bursty anomalies. In the model training phase, random forest algorithms are widely used for their excellent performance and interpretability. The efficiency and accuracy of the model can be improved by calculating the information gain ratio to select the most differentiated features. For example, if the information gain rate of the feature "response time" is found to be highest, indicating that it is most critical to determining anomalies, priority should be given. Model evaluation uses accuracy as an indicator, which helps measure the ability of the model to identify anomalies. Assuming 95% accuracy on the test set, this means that the model can accurately identify 95% of anomalies, but there is still 5% misjudgment and further optimization is required. Finally, the trained model is applied to real-time log analysis. When a new log is received, the system can quickly judge whether the log is abnormal. For example, if an API is detected to return an error frequently in a short time, the model may mark it as abnormal, and trigger an alarm mechanism, so that an operator can intervene in time. Compared with the traditional rule matching, the log analysis method based on machine learning has stronger adaptability and accuracy. The method can automatically learn complex abnormal modes, adapt to continuously-changing system environments and effectively improve the stability and reliability of services. Meanwhile, through continuous model updating and optimization, the abnormality detection capability of the system can be continuously improved, and long-term benefits are brought to IT operation and maintenance of enterprises.
And S106, when the monitoring system finds that a certain service is abnormal, automatically notifying the operation and maintenance personnel of abnormal information according to a pre-configured alarm rule, triggering an automatic emergency treatment process, and taking different treatment measures, such as restarting a service process, rolling back a patch version and the like according to the severity and the influence range of the abnormality.
According to the log information of the collection server cluster, forming a log information stream through log structuring processing, identifying unstructured abnormal texts in the log information stream by using a natural language processing algorithm, comparing vectors corresponding to the abnormal texts with vectors in a pre-established feature library, calculating the matching degree, and triggering an alarm program if the matching degree exceeds a preset threshold value of 85%. And analyzing the matching degree sequence by using the time sequence prediction model, predicting the time point of the subsequent error log, if the time point sequence is dense, obtaining a service process fault conclusion, judging that the system reaches the severity level, acquiring historical state information as characteristic parameters, and transmitting the historical state information into a pre-constructed multi-mode fault prediction model to obtain an output result of the service to be restarted. Training a cyclic neural network model aiming at the fault information sequence, generating a fault investigation scheme set by predicting the next character of the information sequence, scoring the confidence degree of different steps in the fault investigation scheme set by comparing the actual service log content to obtain an optimal fault investigation strategy, and judging that the first step of the strategy needs to acquire a code management library modification record. Extracting a recent code modification record file and a historical version information file from a code management library, analyzing the difference between the files to obtain an associated file group, calculating the difference proportion before and after modification for each file in the associated file group, and marking the modification record file as a suspicious file if the difference proportion exceeds the limit. Transmitting the suspicious file information to a vulnerability scanning tool for file security assessment, creating an abstract syntax tree according to a code syntax structure, traversing each node of the abstract syntax tree, performing comparison operation to obtain the number of vulnerabilities, and determining the association degree of the marked file and the vulnerabilities when the number of vulnerabilities exceeds a security threshold. If the number of loopholes exceeds 8 of preset thresholds, a recently deployed patch information set is obtained, vectorization processing is carried out on the patch information set, a vector space model technology is adopted to represent a current version state, and the current version state and a historical version state are used for obtaining a difference point set through a version comparison tool to obtain a patch information file set needing rollback. And constructing a reverse operation instruction set according to the patch information file set to be rolled back and the current version information, performing cluster analysis operation on a new log information vector generated after the instruction is executed by adopting a cluster analysis algorithm, and judging the risk degree of the instruction set by evaluating the compactness degree in a cluster result to determine whether to execute the reverse operation instruction set.
Illustratively, the collection server cluster log information is the starting point for anomaly detection. It is assumed that a server cluster of a certain e-commerce platform generates several GB of log data every day, and these logs record detailed information such as user access, transaction processing, database operation, etc. Unstructured text is converted into a structured data format, such as JSON or CSV, through log structured processing, which facilitates subsequent analysis. Unstructured anomalous text is identified using Natural Language Processing (NLP) algorithms, e.g., keywords in the log are extracted using TF-IDF algorithm, and the text is converted to a vector representation in conjunction with Word embedding techniques such as Word2 Vec. Assuming that a certain log record is 'database connection failure', comparing a vector corresponding to the text with a 'database abnormal' vector in a feature library after NLP processing, calculating cosine similarity, and triggering an alarm program if the similarity exceeds 85%. A timing prediction model such as ARIMA or LSTM network is used to analyze the matching degree sequences. Assuming that the log matching degree of database connection failure frequently exceeds a threshold value in the past week, the model predicts that more similar errors may occur in the future 24 hours, and if the predicted time point sequences are dense, the service process is judged to be likely to have faults, and the system reaches a severity level. Historical state information such as CPU utilization rate, memory occupation, network delay and the like is acquired and is used as characteristic parameters to be transmitted into a pre-constructed multi-mode fault prediction model. The model may contain a hybrid structure of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), with output results suggesting restart services to mitigate the failure. And training a cyclic neural network model aiming at the fault information sequence, and generating a fault troubleshooting scheme set by predicting the next character of the information sequence. For example, the model predicts "check database connection configuration", scores the confidence of this step in combination with the actual service log content, and finally determines that the best troubleshooting strategy is "first step acquire code management library modification record". The method comprises the steps of extracting a recent code modification record file and a historical version information file from a code management library, and analyzing the differences by using a file comparison tool (such as diff) to obtain an associated file group. Assuming that the difference ratio of a certain file before and after modification exceeds 30%, the file is marked as suspicious. The suspicious file information is transmitted to a vulnerability scanning tool (such as a SonarQube), nodes are traversed by creating Abstract Syntax Trees (AST), and the comparison operation discovers the vulnerability number. If the number of vulnerabilities exceeds 8, determining that the markup file is highly associated with the vulnerability. And obtaining a recently deployed patch information set, carrying out vectorization processing, and representing the current version state by adopting a Vector Space Model (VSM) technology. And obtaining the difference point set through a version comparison tool (such as GitDiff) to obtain the patch information file set needing to be rolled back. And constructing a reverse operation instruction set according to the patch information file set needing to be rolled back and the current version information. And (3) carrying out cluster analysis on the new log information vector generated after the instruction is executed by adopting a cluster analysis algorithm (such as K-means), evaluating the tightness degree in the cluster result cluster, and judging the risk degree of the instruction set. If the intra-cluster compactness is high, indicating a low risk of reverse operation, the instruction set may be executed. Through the steps, abnormal behaviors are automatically found from log data, alarm information is generated, and stability and safety of the system are ensured through multi-level model analysis and operation instruction generation. The comprehensive abnormality detection and fault detection mechanism can effectively improve the operation and maintenance efficiency and response speed of the system and reduce the service interruption time caused by faults.
S107, summarizing and summarizing configuration parameters, log formats, common faults and other information of different operating systems and services in an operation and maintenance knowledge base to form a set of standardized operation and maintenance specifications and processes, and correlating and reasoning the knowledge through a knowledge graph technology to form an intelligent operation and maintenance decision support system, so as to provide diagnosis and treatment suggestions for operation and maintenance personnel.
And carrying out association and reasoning through a knowledge graph technology according to the operating system and service configuration information in the operation and maintenance knowledge base to obtain a standardized configuration parameter template, so as to form a configuration specification base. And analyzing and extracting the log format in the operation and maintenance knowledge base by adopting a natural language processing technology, acquiring key fields and event types, constructing a log analysis model, and realizing automatic analysis of the log. And carrying out association analysis on common fault information in the operation and maintenance knowledge base through a knowledge graph technology to obtain reasons, influence ranges and solutions of faults, thereby forming a fault diagnosis knowledge base. And according to the configuration specification library, the log analysis model and the fault diagnosis knowledge base, adopting a rule-based reasoning engine to realize the automatic decision of the operation and maintenance flow and generating standardized operation and maintenance operation steps. And training the historical operation and maintenance data through a machine learning algorithm to obtain a classification model and a prediction model of the operation and maintenance event, so as to realize early warning and advanced treatment of potential faults. Integrating the operation and maintenance decision support system with a monitoring platform, acquiring the operation state data of the system and the service in real time, judging whether abnormality exists or not through a knowledge graph reasoning and machine learning model, and giving diagnosis suggestions. According to the diagnosis suggestions and standardized operation steps, an automatic treatment script is generated, and the operation and maintenance automatic platform is used for executing the operation and maintenance operation, so that the quick repair and system recovery of faults are realized, and the operation and maintenance efficiency and the system availability are improved.
Illustratively, the operation and maintenance knowledge base is the core of IT operation and maintenance management, and contains a great deal of configuration information, log format and fault handling experience. Through knowledge graph technology, the information can be structured and subjected to association analysis to form a standardized configuration specification library. For example, for the security configuration of the Linux server, key parameters such as the maximum login failure times, the password complexity requirements and the like can be extracted, and the association relationship between the key parameters is established. Thus, whether the configuration meets the specification can be rapidly checked, and potential safety risks can be deduced. Log parsing is an important link in operation and maintenance automation. Key information can be extracted from unstructured log text through natural language processing techniques. For example, for a log of "user login failure", fields such as a user name, an IP address, a failure cause, etc. may be identified. These structured log data provide the basis for subsequent analysis and decision making. The construction of the fault diagnosis knowledge base utilizes the reasoning capability of the knowledge graph. By analyzing historical fault cases, associations between fault symptoms, causes, and solutions can be established. For example, when a Web service is found to be slow in response, the system can infer possible reasons including exhaustion of a database connection pool, network congestion and the like according to the knowledge graph, and provide corresponding investigation steps. Rule-based reasoning engines are key to implementing automated decisions. The method combines configuration specifications, log analysis results and fault diagnosis knowledge to generate a standardized operation and maintenance operation flow. For example, when it is detected that server CPU usage continues beyond 90%, the system may automatically generate a series of investigation steps, including checking progress status, analyzing load sources, etc. The machine learning algorithm plays an important role in operation and maintenance early warning. Through training the historical data, a prediction model for system abnormality can be established. For example, by analyzing past disk usage trends, the system can predict when disk space may be exhausted and send out early warning ahead of time. The operation and maintenance decision support system is integrated with the monitoring platform, so that real-time abnormality detection and diagnosis can be realized. For example, when the inquiry delay of a certain database service is monitored to suddenly increase, the system can immediately start a diagnosis flow, analyze whether problems of slow inquiry, index failure and the like exist, and give optimization suggestions. Finally, the generation and execution of automated handling scripts is critical to improving operation and maintenance efficiency. For example, when it is diagnosed that a service abnormality is caused by memory leakage, the system may automatically generate a script for restarting the service, and execute the script through the operation and maintenance platform to achieve quick recovery. This not only reduces human intervention, but also greatly shortens the failover time. Through the series of technology and flow, an intelligent operation and maintenance system can be constructed, so that the problems can be responded and solved quickly, potential faults can be predicted and prevented, and the usability and stability of the system are improved obviously.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1.一种基于云服务的操作系统更新方法,其特征在于,所述方法包括:1. A method for updating an operating system based on a cloud service, characterized in that the method comprises: 通过扫描云主机的系统信息,获取不同主机的操作系统版本和补丁安装情况,根据操作系统类型和版本号,将主机划分到对应的管理分组中,针对每个分组制定统一的补丁策略和更新计划;By scanning the system information of the cloud host, the operating system version and patch installation status of different hosts are obtained. According to the operating system type and version number, the hosts are divided into corresponding management groups, and a unified patch strategy and update plan are formulated for each group; 在政务网内部搭建一个镜像服务器,从可信的补丁源下载各操作系统版本所需的安全更新文件,通过提取更新包的元数据信息,分析补丁的适用系统、版本号等关键属性,采用机器学习的聚类算法,自动将补丁文件归类到对应的目录中;Build a mirror server inside the government network to download security update files required for each operating system version from a trusted patch source. By extracting metadata information from the update package and analyzing key attributes such as the applicable system and version number of the patch, the machine learning clustering algorithm is used to automatically classify the patch files into corresponding directories. 针对镜像服务器中的补丁文件,提取关键特征,训练一个智能分类模型,当新的补丁文件加入时,该模型可以准确识别补丁的适用范围,自动将其放入对应的目录,确保不同操作系统版本都能找到匹配的更新文件;For the patch files in the image server, key features are extracted and an intelligent classification model is trained. When a new patch file is added, the model can accurately identify the applicable scope of the patch and automatically put it into the corresponding directory to ensure that different operating system versions can find matching update files. 在云主机上部署监控代理,实时采集各服务进程的运行指标和日志信息,将采集到的数据汇总到一个集中的日志分析平台,通过批量计算框架对日志进行解析和数据挖掘,构建服务运行的行为模型;Deploy monitoring agents on cloud hosts to collect the operating indicators and log information of each service process in real time, aggregate the collected data to a centralized log analysis platform, parse and mine the logs through a batch computing framework, and build a behavioral model for service operation. 针对日志分析平台中的数据,使用机器学习算法,通过有监督学习的方式,训练一个异常检测模型,该模型可以从海量的日志数据中,自动发现服务运行过程中的各类异常行为,并生成告警信息;Based on the data in the log analysis platform, we use machine learning algorithms to train an anomaly detection model through supervised learning. This model can automatically discover various abnormal behaviors during service operation from massive log data and generate alarm information. 当监控系统发现某个服务出现异常时,根据预先配置的告警规则,自动将异常信息通知给运维人员,同时触发自动化的应急处置流程,根据异常的严重程度和影响范围,采取不同的处理措施,如重启服务进程、回滚补丁版本等;When the monitoring system finds that a service is abnormal, it automatically notifies the operation and maintenance personnel of the abnormal information according to the pre-configured alarm rules, and triggers the automated emergency response process. Different processing measures are taken according to the severity and impact scope of the abnormality, such as restarting the service process, rolling back the patch version, etc. 在运维知识库中,总结和归纳不同操作系统和服务的配置参数、日志格式、常见故障等信息,形成一套标准化的运维规范和流程,通过知识图谱技术,将这些知识进行关联和推理,形成一个智能化的运维决策支持系统,为运维人员提供诊断和处置建议。In the operation and maintenance knowledge base, the configuration parameters, log formats, common faults and other information of different operating systems and services are summarized and organized to form a set of standardized operation and maintenance specifications and processes. Through knowledge graph technology, this knowledge is associated and inferred to form an intelligent operation and maintenance decision support system to provide diagnosis and disposal suggestions for operation and maintenance personnel. 2.根据权利要求1所述的方法,其特征在于,所述通过扫描云主机的系统信息,获取不同主机的操作系统版本和补丁安装情况,根据操作系统类型和版本号,将主机划分到对应的管理分组中,针对每个分组制定统一的补丁策略和更新计划,包括:2. The method according to claim 1 is characterized in that the system information of the cloud host is scanned to obtain the operating system version and patch installation status of different hosts, the hosts are divided into corresponding management groups according to the operating system type and version number, and a unified patch strategy and update plan are formulated for each group, including: 通过API接口或SSH协议,远程连接到云主机,执行系统命令,获取云主机的操作系统类型、版本号和补丁安装列表等系统信息;Remotely connect to the cloud host through the API interface or SSH protocol, execute system commands, and obtain system information such as the cloud host's operating system type, version number, and patch installation list; 根据获取的云主机系统信息,采用决策树算法,按照预设的操作系统类型和版本号规则,判断每台云主机所属的管理分组;Based on the obtained cloud host system information, a decision tree algorithm is used to determine the management group to which each cloud host belongs according to the preset operating system type and version number rules; 在配置管理数据库中,根据云主机的唯一标识符,更新其所属管理分组的字段,将云主机划分到对应的分组中;In the configuration management database, according to the unique identifier of the cloud host, the field of the management group to which it belongs is updated, and the cloud host is divided into the corresponding group; 针对每个管理分组,由运维人员根据该分组的操作系统特点,制定统一的系统补丁策略,确定需要安装的补丁列表和安装优先级;For each management group, the operation and maintenance personnel formulate a unified system patch strategy based on the characteristics of the operating system of the group, determine the list of patches to be installed and the installation priority; 将补丁策略转化为可执行的脚本程序,采用Ansible等自动化运维工具,批量下发补丁安装任务到分组内的所有云主机;Convert the patch strategy into an executable script program, and use automated operation and maintenance tools such as Ansible to batch issue patch installation tasks to all cloud hosts in the group. 在补丁安装过程中,实时获取每台云主机的安装进度和结果,记录在系统日志中,便于跟踪和审计;During the patch installation process, the installation progress and results of each cloud host are obtained in real time and recorded in the system log for easy tracking and auditing; 补丁安装完成后,再次扫描云主机的系统信息,获取最新的补丁安装情况,判断是否与补丁策略一致,若不一致则进行差异化更新,确保每个分组内的云主机都符合补丁策略的要求;After the patch installation is completed, scan the system information of the cloud host again to obtain the latest patch installation status and determine whether it is consistent with the patch policy. If not, perform differential updates to ensure that the cloud hosts in each group meet the requirements of the patch policy. 还包括:通过扫描获取云主机操作系统版本和补丁安装情况,根据系统类型和版本号划分管理分组,针对分组制定统一补丁策略和更新计划,确定各云主机的补丁更新方案。It also includes: obtaining the cloud host operating system version and patch installation status through scanning, dividing management groups according to system type and version number, formulating a unified patch strategy and update plan for the groups, and determining the patch update plan for each cloud host. 3.根据权利要求2所述的方法,其特征在于,所述通过扫描获取云主机操作系统版本和补丁安装情况,根据系统类型和版本号划分管理分组,针对分组制定统一补丁策略和更新计划,确定各云主机的补丁更新方案,包括:3. The method according to claim 2 is characterized in that the step of obtaining the cloud host operating system version and patch installation status by scanning, dividing management groups according to system type and version number, formulating a unified patch strategy and update plan for each group, and determining a patch update plan for each cloud host comprises: 通过扫描工具对云主机进行全面扫描,获取每台云主机的操作系统类型、版本号以及已安装的补丁信息;Use scanning tools to perform a comprehensive scan of the cloud host to obtain the operating system type, version number, and installed patch information of each cloud host; 根据获取的操作系统类型和版本号,采用聚类算法对云主机进行自动分组,每个分组内的云主机具有相似的操作系统特征;Based on the obtained operating system type and version number, a clustering algorithm is used to automatically group the cloud hosts. The cloud hosts in each group have similar operating system characteristics. 针对每个分组,分析该组内云主机的补丁安装情况,识别出存在的补丁缺失和漏洞风险,并根据风险等级确定补丁更新的优先级;For each group, analyze the patch installation status of the cloud hosts in the group, identify the existing patch missing and vulnerability risks, and determine the priority of patch updates based on the risk level; 结合云主机分组信息和补丁优先级,制定各组的统一补丁策略和更新计划,明确补丁更新的时间节点和具体操作步骤;Based on the cloud host grouping information and patch priority, formulate a unified patch strategy and update plan for each group, and clarify the time nodes and specific operation steps for patch updates; 在制定补丁策略时,通过关联规则挖掘算法,分析不同补丁之间的依赖关系和兼容性,确保补丁更新的合理性和安全性;When formulating patch strategies, we use association rule mining algorithms to analyze the dependencies and compatibility between different patches to ensure the rationality and security of patch updates. 根据统一的补丁策略和更新计划,针对每台云主机生成个性化的补丁更新方案,明确需要安装的补丁列表和具体的操作流程;Generate a personalized patch update plan for each cloud host based on a unified patch strategy and update plan, clearly specifying the patch list that needs to be installed and the specific operation process; 采用自动化运维工具,根据补丁更新方案,批量对云主机进行补丁更新和修复,并通过监控和回滚机制,确保补丁更新过程的平稳和可控。Adopt automated operation and maintenance tools to batch update and repair cloud hosts according to the patch update plan, and ensure a smooth and controllable patch update process through monitoring and rollback mechanisms. 4.根据权利要求1所述的方法,其特征在于,所述在政务网内部搭建一个镜像服务器,从可信的补丁源下载各操作系统版本所需的安全更新文件,通过提取更新包的元数据信息,分析补丁的适用系统、版本号等关键属性,采用机器学习的聚类算法,自动将补丁文件归类到对应的目录中,包括:4. The method according to claim 1 is characterized in that a mirror server is built inside the government network, and the security update files required by each operating system version are downloaded from a trusted patch source. By extracting metadata information of the update package, analyzing key attributes such as the applicable system and version number of the patch, and using a machine learning clustering algorithm, the patch files are automatically classified into corresponding directories, including: 获取政务网内部可用的服务器资源,确定用于搭建镜像服务器的硬件配置和网络环境;Obtain the server resources available within the government network and determine the hardware configuration and network environment for building the mirror server; 从可信补丁源获取各操作系统版本的安全更新文件,下载到镜像服务器的指定存储位置;Obtain security update files for each operating system version from a trusted patch source and download them to the designated storage location of the mirror server; 针对下载的补丁文件,提取其元数据信息,获取补丁的适用系统和版本号等关键属性;Extract metadata information from downloaded patch files to obtain key attributes such as applicable systems and version numbers of the patches. 根据提取的补丁关键属性,采用K-means聚类算法,将补丁文件按照适用系统和版本号进行自动分类;Based on the extracted patch key attributes, the K-means clustering algorithm is used to automatically classify patch files according to applicable systems and version numbers; 通过聚类算法得到的分类结果,判断每个补丁文件所属的操作系统和版本,将其移动到镜像服务器的对应目录下;The classification results obtained by the clustering algorithm are used to determine the operating system and version to which each patch file belongs, and then the patch file is moved to the corresponding directory on the mirror server. 在镜像服务器上配置Nginx等Web服务,提供补丁文件的下载服务,确保政务网内部各系统可以访问和下载所需的安全更新;Configure Nginx and other Web services on the mirror server to provide patch file download services to ensure that all systems within the government network can access and download the required security updates; 持续监控可信补丁源的更新,通过定时任务自动下载新发布的补丁文件,并重复执行上述分类和发布流程,保持镜像服务器的补丁库与可信源同步更新。Continuously monitor the updates of the trusted patch source, automatically download the newly released patch files through scheduled tasks, and repeat the above classification and release process to keep the patch library of the mirror server synchronized with the trusted source. 5.根据权利要求1所述的方法,其特征在于,所述针对镜像服务器中的补丁文件,提取关键特征,训练一个智能分类模型,当新的补丁文件加入时,该模型可以准确识别补丁的适用范围,自动将其放入对应的目录,确保不同操作系统版本都能找到匹配的更新文件,包括:5. The method according to claim 1 is characterized in that, for the patch files in the mirror server, key features are extracted and an intelligent classification model is trained. When a new patch file is added, the model can accurately identify the applicable scope of the patch and automatically put it into the corresponding directory to ensure that different operating system versions can find matching update files, including: 获取镜像服务器内全部补丁文件元数据,解析文件元数据获得文件名、大小和哈希值及关联信息列表,不同操作系统补丁文件得到初步分类集合;Obtain metadata of all patch files in the image server, parse the file metadata to obtain the file name, size, hash value and related information list, and obtain a preliminary classification collection of patch files of different operating systems; 根据初步分类集合中每个补丁文件关联信息列表确定操作系统字段与版本字段,提取补丁文件适用操作系统版本信息,根据版本信息,对补丁文件集合标注获得样本数据;Determine the operating system field and the version field according to the list of associated information of each patch file in the preliminary classification set, extract the operating system version information applicable to the patch file, and annotate the patch file set according to the version information to obtain sample data; 解析样本数据,生成二进制数据流,根据二进制数据流进行数值统计。Parse sample data, generate binary data stream, and perform numerical statistics based on the binary data stream. 6.根据权利要求1所述的方法,其特征在于,所述在云主机上部署监控代理,实时采集各服务进程的运行指标和日志信息,将采集到的数据汇总到一个集中的日志分析平台,通过批量计算框架对日志进行解析和数据挖掘,构建服务运行的行为模型,包括:6. The method according to claim 1 is characterized in that the monitoring agent is deployed on the cloud host to collect the operation indicators and log information of each service process in real time, the collected data is aggregated to a centralized log analysis platform, the logs are parsed and data mined through a batch computing framework, and a behavior model of service operation is constructed, including: 部署代理端采集云主机各服务进程集中每个进程的运行指标值和日志流,上传数据;Deploy the agent to collect the operating indicator values and log streams of each process in the cloud host service process set and upload the data; 根据预设的日志流数据格式规范,日志分析平台接收代理端上传数据,对不同服务进程产生的日志流执行数据分类操作,不同分类得到不同标识;According to the preset log stream data format specification, the log analysis platform receives the data uploaded by the agent, performs data classification operations on the log streams generated by different service processes, and different classifications are marked differently; 通过日志流中各字段的数据内容,分析台中解析器模块执行信息提取操作,获得日志关键信息,不同服务进程的日志,确定不同服务进程的标识;Through the data content of each field in the log stream, the Taichung parser module performs information extraction operations to obtain key log information, logs of different service processes, and determine the identifiers of different service processes; 采用预先训练的机器学习模型进行不同进程的异常检测操作,将日志关键信息和标识输入二分类模型,得到各进程正常或异常的分类结果;Use pre-trained machine learning models to perform anomaly detection operations on different processes, input log key information and identifiers into the binary classification model, and obtain the normal or abnormal classification results of each process; 获取各进程的运行指标值,构建每个进程的运行指标值集合,运行指标值集合输入预先训练的回归模型中,确定回归模型输出,得到每个进程的性能预测结果;Obtain the running index value of each process, build a running index value set for each process, input the running index value set into the pre-trained regression model, determine the regression model output, and obtain the performance prediction result of each process; 根据回归模型输出结果和运行指标值,计算块对服务进程集中每个服务进程的负载状态进行统计分析操作,通过比较历史数据中相似行为的进程负载情况,得到一个行为序列;According to the output results of the regression model and the operating index values, the calculation block performs statistical analysis on the load status of each service process in the service process set, and obtains a behavior sequence by comparing the process load status of similar behaviors in historical data; 根据行为序列和当前服务进程行为构建行为库,行为序列用马尔可夫模型进行表示,确定服务进程的下一个行为。A behavior library is constructed based on the behavior sequence and the current service process behavior. The behavior sequence is represented by a Markov model to determine the next behavior of the service process. 7.根据权利要求1所述的方法,其特征在于,所述针对日志分析平台中的数据,使用机器学习算法,通过有监督学习的方式,训练一个异常检测模型,该模型可以从海量的日志数据中,自动发现服务运行过程中的各类异常行为,并生成告警信息,包括:7. The method according to claim 1 is characterized in that, for the data in the log analysis platform, a machine learning algorithm is used to train an anomaly detection model through supervised learning. The model can automatically discover various abnormal behaviors in the service operation process from massive log data and generate alarm information, including: 根据日志内容构建日志数据集,该日志数据集应覆盖历史数据记录的各种日志条目,其中正常和异常日志条目具有数据标签的标记;Construct a log dataset based on the log content. The log dataset should cover various log entries of historical data records, where normal and abnormal log entries are marked with data labels. 针对有数据标签的日志数据集进行数据预处理,提取特征,获得数值型数据集,并划分为训练集和测试集,训练集中不同服务状态的样本比例应保持平衡;Perform data preprocessing on the labeled log data set, extract features, obtain a numerical data set, and divide it into a training set and a test set. The proportion of samples with different service states in the training set should be balanced. 通过训练集对支持向量机算法的模型训练,得到初始检测模型,通过初始检测模型对测试集进行测试,计算出该初始检测模型针对测试集检测结果的各项指标;The support vector machine algorithm is trained through the training set to obtain an initial detection model, the test set is tested through the initial detection model, and various indicators of the initial detection model for the test set detection results are calculated; 获取真实环境中生成的日志内容并进行数据预处理,并获取之前时间序列的已训练模型作为基准检测模型,采用基准检测模型处理这些日志数据,生成初步预测标签,然后进行标签判定;Obtain the log content generated in the real environment and perform data preprocessing, and obtain the previously trained model of the time series as the benchmark detection model. Use the benchmark detection model to process these log data, generate preliminary prediction labels, and then perform label determination; 根据初步预测标签进行标签判定,针对初步预测标签,若某一条日志记录初步预测标签为异常,同时该条日志记录的时间和日志内容与已有标记数据库中某一条异常记录日志数据符合匹配条件,则确定最终预测标签为异常;The label is determined based on the preliminary predicted label. For the preliminary predicted label, if the preliminary predicted label of a log record is abnormal, and the time and log content of the log record meet the matching conditions with the log data of an abnormal record in the existing label database, the final predicted label is determined to be abnormal. 针对最终预测标签进行异常判定,若某一条日志最终预测标签为异常,则判断该条日志对应的告警级别,采用不同等级的告警方案进行告警,记录最终标记结果到已标记数据库中,然后继续执行判定,如果确定最终预测标签都为正常,则判定该检测模型存在缺陷;Perform anomaly determination on the final predicted labels. If the final predicted label of a log is abnormal, determine the alarm level corresponding to the log, use different levels of alarm schemes to issue an alarm, record the final labeling results in the labeled database, and then continue to perform the determination. If it is determined that the final predicted labels are all normal, it is determined that the detection model has defects. 判断已标记数据库中误报数量比例,若超过设定比例,则将已标记数据库中所有异常标签样本取出进行重新训练,获得新的时间序列的检测模型,然后重新部署模型;Determine the ratio of false positives in the labeled database. If it exceeds the set ratio, take out all abnormal label samples in the labeled database for retraining to obtain a new time series detection model, and then redeploy the model; 还包括:通过分析日志级别、事件时间等属性获取异常日志,根据请求路径、响应状态等属性判断异常类型,采用随机森林算法训练异常检测模型,得到异常检测结果。It also includes: obtaining abnormal logs by analyzing attributes such as log level and event time, judging the abnormal type according to attributes such as request path and response status, and using random forest algorithm to train the abnormal detection model to obtain the abnormal detection results. 8.根据权利要求7所述的方法,其特征在于,所述通过分析日志级别、事件时间等属性获取异常日志,根据请求路径、响应状态等属性判断异常类型,采用随机森林算法训练异常检测模型,得到异常检测结果,包括:8. The method according to claim 7 is characterized in that the abnormal log is obtained by analyzing the attributes such as log level and event time, the abnormal type is judged according to the attributes such as request path and response status, and the abnormal detection model is trained by random forest algorithm to obtain the abnormal detection result, including: 获取服务运行产生的日志数据,解析每条日志数据包含日志等级、事件时间属性信息,如果日志等级符合预先确定的异常等级,则提取该日志作为异常日志,得到多个异常日志样本;Obtain the log data generated by the service operation, parse each log data to include the log level and event time attribute information, and if the log level meets the predetermined abnormal level, extract the log as an abnormal log to obtain multiple abnormal log samples; 获取每个异常日志,解析每个异常日志的请求路径、响应状态属性信息,根据请求路径与预先构建的路径规则库进行匹配得到匹配信息,根据响应状态与预先确定的异常状态表进行匹配得到异常状态信息,如果匹配信息符合某个类型且异常状态符合某个类型,则综合两者判断异常类型,得到每个异常日志的异常类型;Obtain each exception log, parse the request path and response status attribute information of each exception log, match the request path with the pre-built path rule library to obtain matching information, match the response status with the pre-determined exception status table to obtain exception status information, and if the matching information matches a certain type and the exception status matches a certain type, then combine the two to determine the exception type and obtain the exception type of each exception log; 提取异常日志特征,特征提取包含异常日志文本数值特征与异常日志时序特征,把异常日志按数量分为训练集和测试集,训练集中一半异常日志用于构建模型,另一半异常日志用于模型训练,得到全部训练集数据;Extract abnormal log features. Feature extraction includes abnormal log text numerical features and abnormal log time series features. The abnormal logs are divided into training set and test set according to the number. Half of the abnormal logs in the training set are used to build the model, and the other half are used for model training to obtain all the training set data. 根据数据预处理操作时生成的全部训练集数据,确定训练集的异常日志特征集合;Determine the abnormal log feature set of the training set according to all training set data generated during the data preprocessing operation; 构建训练集的每个异常日志的特征矩阵;Construct the feature matrix of each abnormal log in the training set; 针对全部训练集的异常日志特征集合构建的特征矩阵,根据每个训练集异常日志特征在不同类别的统计量值计算得到各个训练集异常日志特征的信息增益率,按信息增益率对训练集异常日志特征排序,将排序后的特征矩阵依次输入至构建的随机森林模型,训练随机森林模型,循环迭代直至达到收敛条件后,得到完成训练的随机森林模型;A feature matrix is constructed for the abnormal log feature set of all training sets. The information gain rate of each abnormal log feature of the training set is calculated according to the statistical value of each abnormal log feature of the training set in different categories. The abnormal log features of the training set are sorted according to the information gain rate. The sorted feature matrix is input into the constructed random forest model in sequence. The random forest model is trained and iterated until the convergence condition is reached to obtain a trained random forest model. 获取数据预处理操作时的测试集的每个异常日志,确定每个测试集异常日志特征集合,获取每个测试集异常日志在不同类别的统计量值计算得到各个测试集异常日志特征的信息增益率;Obtain each abnormal log of the test set during the data preprocessing operation, determine the feature set of each test set abnormal log, obtain the statistical value of each test set abnormal log in different categories, and calculate the information gain rate of each test set abnormal log feature; 按信息增益率对测试集异常日志特征排序;Sort the abnormal log features of the test set by information gain rate; 得到排序后的测试集特征矩阵;Get the sorted test set feature matrix; 依据完成训练的随机森林模型对已排序的测试集异常日志特征矩阵进行输入,进行前向运算,得到预测输出值,对比测试集标签值,获取每个测试集异常日志预测精确值;According to the trained random forest model, the sorted test set abnormal log feature matrix is input, and forward operation is performed to obtain the predicted output value, which is compared with the test set label value to obtain the accurate prediction value of each test set abnormal log; 计算平均值,得到模型精确度;Calculate the average value to get the model accuracy; 根据精确度判断是否进行下一轮迭代;Determine whether to proceed to the next round of iteration based on accuracy; 迭代结束后得到最优检测模型;After the iteration, the optimal detection model is obtained; 通过接口接收线上日志,解析日志包含的日志等级、时间信息,如果日志等级为预设的异常等级,则获取日志样本;Receive online logs through the interface, parse the log level and time information contained in the log, and obtain log samples if the log level is the preset abnormal level; 解析日志的请求路径、响应状态信息,根据请求路径进行路径规则匹配得到匹配信息,根据响应状态进行状态匹配得到状态信息,如果匹配信息与状态信息匹配预设异常类型,则该日志归为异常,依据最优检测模型对异常日志进行检测运算得到该日志是否异常的检测结果。Parse the request path and response status information of the log, match the path rules according to the request path to obtain the matching information, and match the status according to the response status to obtain the status information. If the matching information and the status information match the preset anomaly type, the log is classified as abnormal. Perform detection operations on the abnormal log according to the optimal detection model to obtain the detection result of whether the log is abnormal. 9.根据权利要求1所述的方法,其特征在于,所述当监控系统发现某个服务出现异常时,根据预先配置的告警规则,自动将异常信息通知给运维人员,同时触发自动化的应急处置流程,根据异常的严重程度和影响范围,采取不同的处理措施,如重启服务进程、回滚补丁版本等,包括:9. The method according to claim 1 is characterized in that when the monitoring system finds that a service is abnormal, it automatically notifies the operation and maintenance personnel of the abnormal information according to the pre-configured alarm rules, and triggers the automated emergency handling process at the same time. Different handling measures are taken according to the severity and impact scope of the abnormality, such as restarting the service process, rolling back the patch version, etc., including: 根据采集服务器集群日志信息,通过日志结构化处理,形成日志信息流,利用自然语言处理算法识别日志信息流中非结构化异常文本,将该异常文本对应的向量与预先建立的特征库中的向量比对,计算匹配程度,若该匹配程度超过预设的阈值85%,则触发告警程序;According to the collected server cluster log information, log information stream is formed through log structured processing, and the unstructured abnormal text in the log information stream is identified by natural language processing algorithm. The vector corresponding to the abnormal text is compared with the vector in the pre-established feature library, and the matching degree is calculated. If the matching degree exceeds the preset threshold of 85%, the alarm program is triggered; 利用时序预测模型分析匹配程度序列,预测出后续发生错误日志时间点,如果时间点序列密集,得到服务进程故障结论,判断系统达到严重级别,获取历史状态信息作为特征参数传入预构建的多模态故障预测模型,得出需重启服务的输出结果;Use the time series prediction model to analyze the matching degree sequence and predict the time point of subsequent error log occurrence. If the time point sequence is dense, the conclusion of service process failure is obtained, and the system is judged to have reached a serious level. The historical status information is obtained as a feature parameter and passed into the pre-built multi-modal fault prediction model to obtain the output result that the service needs to be restarted; 针对故障信息序列训练循环神经网络模型,通过预测信息序列下一个字符生成故障排查方案集合,通过比较实际服务日志内容,对该故障排查方案集合中不同步骤进行置信度评分,得到最佳故障排查策略,判断为该策略第一步需要获取代码管理库修改记录;A recurrent neural network model is trained for the fault information sequence, and a set of troubleshooting solutions is generated by predicting the next character in the information sequence. By comparing the actual service log content, the confidence scores of different steps in the set of troubleshooting solutions are performed to obtain the best troubleshooting strategy. It is determined that the first step of this strategy requires obtaining the modification record of the code management library. 从代码管理库中提取出近期代码修改记录文件和历史版本信息文件,分析各个文件间差异性获得关联文件组,针对关联文件组中每一个文件计算修改前后差异比例,如果差异比例超限,则将修改记录文件标记成可疑文件;Extract recent code modification record files and historical version information files from the code management library, analyze the differences between each file to obtain the associated file group, and calculate the difference ratio before and after modification for each file in the associated file group. If the difference ratio exceeds the limit, mark the modification record file as a suspicious file; 将该可疑文件信息传输到漏洞扫描工具中,进行文件安全性评估,根据代码语法结构创建抽象语法树,通过遍历抽象语法树各节点并进行比较操作,得到漏洞数量,判断超过安全阈值时,确定标记文件与漏洞关联度;The suspicious file information is transferred to the vulnerability scanning tool to perform file security assessment. An abstract syntax tree is created based on the code syntax structure. The number of vulnerabilities is obtained by traversing each node of the abstract syntax tree and performing comparison operations. When it is determined that the security threshold is exceeded, the correlation between the marked file and the vulnerability is determined. 如果漏洞数量超出预设的阈值8个,获取近期部署的补丁信息集合,对该补丁信息集合进行向量化处理,采用向量空间模型技术表示当前版本状态,将当前版本状态和历史版本状态通过版本对比工具获取差异点集合,得到需回滚的补丁信息文件集合;If the number of vulnerabilities exceeds the preset threshold of 8, obtain the patch information set deployed recently, vectorize the patch information set, use vector space model technology to represent the current version status, use version comparison tools to obtain the difference point set between the current version status and the historical version status, and obtain the patch information file set that needs to be rolled back; 根据该需回滚的补丁信息文件集合与当前版本信息构建逆向操作指令集合,采用聚类分析算法对指令执行后产生新日志信息向量进行聚类分析操作,通过评估聚类结果簇内紧密程度,判断指令集合风险程度,确定是否执行该逆向操作指令集合。A reverse operation instruction set is constructed based on the patch information file set that needs to be rolled back and the current version information. A clustering analysis algorithm is used to perform clustering analysis operations on the new log information vector generated after the instruction execution. By evaluating the tightness within the clustering result cluster, the risk level of the instruction set is judged to determine whether to execute the reverse operation instruction set. 10.根据权利要求1所述的方法,其特征在于,所述在运维知识库中,总结和归纳不同操作系统和服务的配置参数、日志格式、常见故障等信息,形成一套标准化的运维规范和流程,通过知识图谱技术,将这些知识进行关联和推理,形成一个智能化的运维决策支持系统,为运维人员提供诊断和处置建议,包括:10. The method according to claim 1 is characterized in that, in the operation and maintenance knowledge base, the configuration parameters, log formats, common faults and other information of different operating systems and services are summarized and summarized to form a set of standardized operation and maintenance specifications and processes. Through the knowledge graph technology, these knowledge are associated and reasoned to form an intelligent operation and maintenance decision support system to provide diagnosis and disposal suggestions for operation and maintenance personnel, including: 根据运维知识库中的操作系统和服务配置信息,通过知识图谱技术进行关联和推理,得到标准化的配置参数模板,形成配置规范库;Based on the operating system and service configuration information in the operation and maintenance knowledge base, the knowledge graph technology is used to associate and infer information, obtain standardized configuration parameter templates, and form a configuration specification library; 采用自然语言处理技术,对运维知识库中的日志格式进行解析和提取,获取关键字段和事件类型,构建日志解析模型,实现日志的自动化分析;Use natural language processing technology to parse and extract the log format in the operation and maintenance knowledge base, obtain key fields and event types, build a log parsing model, and realize automatic analysis of logs; 针对运维知识库中的常见故障信息,通过知识图谱技术进行关联分析,得到故障的原因、影响范围和解决方案,形成故障诊断知识库;For common fault information in the operation and maintenance knowledge base, we use knowledge graph technology to perform correlation analysis to obtain the cause, impact range and solution of the fault, and form a fault diagnosis knowledge base; 根据配置规范库、日志解析模型和故障诊断知识库,采用基于规则的推理引擎,实现运维流程的自动化决策,生成标准化的运维操作步骤;Based on the configuration specification library, log parsing model and fault diagnosis knowledge base, a rule-based reasoning engine is used to realize automated decision-making of the operation and maintenance process and generate standardized operation and maintenance operation steps; 通过机器学习算法,对历史运维数据进行训练,得到运维事件的分类模型和预测模型,实现对潜在故障的预警和提前处置;Through machine learning algorithms, historical operation and maintenance data is trained to obtain classification models and prediction models for operation and maintenance events, thus achieving early warning and early disposal of potential failures. 将运维决策支持系统与监控平台集成,实时获取系统和服务的运行状态数据,通过知识图谱推理和机器学习模型,判断是否存在异常,并给出诊断建议;Integrate the operation and maintenance decision support system with the monitoring platform to obtain real-time operating status data of systems and services, determine whether there are abnormalities through knowledge graph reasoning and machine learning models, and provide diagnostic suggestions; 根据诊断建议和标准化运维操作步骤,生成自动化的处置脚本,通过运维自动化平台执行,实现故障的快速修复和系统恢复,提高运维效率和系统可用性。Generate automated handling scripts based on diagnostic recommendations and standardized operation and maintenance procedures, and execute them through the operation and maintenance automation platform to achieve rapid fault repair and system recovery, thereby improving operation and maintenance efficiency and system availability.
CN202411992030.5A 2024-12-31 2024-12-31 A method for updating an operating system based on cloud services Pending CN120085885A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411992030.5A CN120085885A (en) 2024-12-31 2024-12-31 A method for updating an operating system based on cloud services

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411992030.5A CN120085885A (en) 2024-12-31 2024-12-31 A method for updating an operating system based on cloud services

Publications (1)

Publication Number Publication Date
CN120085885A true CN120085885A (en) 2025-06-03

Family

ID=95856664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411992030.5A Pending CN120085885A (en) 2024-12-31 2024-12-31 A method for updating an operating system based on cloud services

Country Status (1)

Country Link
CN (1) CN120085885A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120315752A (en) * 2025-06-12 2025-07-15 苏州元脑智能科技有限公司 Server firmware version management method and program product

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120315752A (en) * 2025-06-12 2025-07-15 苏州元脑智能科技有限公司 Server firmware version management method and program product
CN120315752B (en) * 2025-06-12 2025-08-22 苏州元脑智能科技有限公司 Server firmware version management method and program product

Similar Documents

Publication Publication Date Title
Zhao et al. Identifying bad software changes via multimodal anomaly detection for online service systems
Zhang et al. Failure diagnosis in microservice systems: A comprehensive survey and analysis
CN107770797A (en) Correlation analysis method and system for wireless network alarm management
US11290325B1 (en) System and method for change reconciliation in information technology systems
CN120217158B (en) Asset operation and maintenance decision management platform and management method based on data fusion
CN120597287A (en) Vulnerability management method and system based on adaptive security platform
CN120560702B (en) Self-healing operation and maintenance method, device and equipment for big data component and storage medium
CN120631638B (en) Methods, devices, storage media, electronic equipment, and computer program products for determining the cause of a failure.
CN120429162B (en) Food safety spot check data verification method, system and computer storage medium thereof
Cavallaro et al. Identifying anomaly detection patterns from log files: A dynamic approach
Liu et al. Incident-aware duplicate ticket aggregation for cloud systems
Wang et al. KGroot: A knowledge graph-enhanced method for root cause analysis
CN120197957A (en) An abnormal behavior analysis system for data centers based on artificial intelligence
CN118964218A (en) API intelligent management method, device, storage medium and computer equipment
CN114595127B (en) Log exception processing method, device, equipment and storage medium
CN120085885A (en) A method for updating an operating system based on cloud services
US20250077851A1 (en) Remediation generation for situation event graphs
CN118133962A (en) Correlation analysis method, device and system of fault event and storage medium
CN119248560A (en) Airport data service interface fault analysis method and system
CN117311777A (en) Automatic operation and maintenance platform and method
CN120723756B (en) Oil and gas sector service management methods and devices
CN119646702B (en) Anomaly handling methods, systems, and related media based on nuclear power plant technical specifications
US20250298828A1 (en) Natural language interface
CN121486165A (en) Full-link monitoring and tracing method for Galaxy kylin system
CN121327733A (en) Abnormal log analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication