Disclosure of Invention
The embodiment of the application provides a data acquisition method and system based on big data of an intelligent operation and maintenance platform, which are used for solving the problems of low data acquisition accuracy and low flexibility in the prior art.
In a first aspect, an embodiment of the present application provides a data acquisition method based on big data of an intelligent operation and maintenance platform, including:
acquiring configuration files of a plurality of data sources connected with an intelligent operation and maintenance platform, and screening target data sources from the plurality of data sources by using a multi-level rule engine based on the configuration files of the plurality of data sources to form a data source set;
predicting an optimal data acquisition mode corresponding to the data source set by adopting a target prediction model, wherein the target prediction model is introduced with a support vector machine, a target clustering algorithm and a long-term and short-term memory network;
The method comprises the steps that a distributed computing framework of an intelligent operation and maintenance platform is combined, data acquisition tasks of a data source set are distributed to a plurality of devices in the intelligent operation and maintenance platform according to an optimal data acquisition mode, and the plurality of devices are used for parallelly acquiring original data acquired by all data sources in the data source set;
And processing the original data, and determining an operation and maintenance result according to the processed data and the operation and maintenance service requirement information.
Optionally, predicting an optimal data acquisition mode corresponding to the data source set by using a target prediction model, wherein the target prediction model is introduced with a support vector machine, a target clustering algorithm and a long-term and short-term memory network, and the method comprises the following steps:
Analyzing the characteristics of each data source in the data source set by using a support vector machine and a target clustering algorithm based on the configuration file of each data source in the data source set to obtain an analysis result;
and according to the analysis result, predicting an optimal data acquisition mode corresponding to the data source set by adopting a long-term and short-term memory network in combination with the enterprise operation and maintenance personalized demand information and the historical data acquisition mode.
Optionally, the analyzing, based on the configuration file of each data source in the data source set, the characteristics of each data source in the data source set by using a support vector machine and a target clustering algorithm to obtain an analysis result includes:
extracting first characteristic information of each data source in the data source set by adopting a metadata extraction technology based on configuration files of each data source in the data source set;
Based on the first characteristic information of each data source in the data source set, classifying the class attribute of all the data sources in the data source set by adopting a support vector machine to obtain a data source classification result;
Based on the data source classification result, clustering all data sources in each category by adopting a target clustering algorithm to obtain a data source clustering result;
Based on the data source clustering result, adopting a principal component analysis technology to select partial characteristic information from the first characteristic information and taking the partial characteristic information as second characteristic information;
Optimizing the selection of an initial center point of a target clustering algorithm in the clustering process by adopting a genetic algorithm based on the association relation between the data sources so as to obtain an optimized data source clustering result;
Based on the optimized data source clustering result, constructing a characteristic probability model of the data source by adopting a Bayesian network so as to form a data source behavior prediction model;
And generating an analysis result based on the configuration file of each data source in the data source set, the first characteristic information of each data source in the data source set, the data source classification result, the data source clustering result, the second characteristic information, the association relationship among the data sources, the optimized data source clustering result and the data source behavior prediction model.
The target clustering algorithm comprises a K-means clustering algorithm, wherein the clustering processing is carried out on all data sources in each category by adopting the target clustering algorithm based on the data source classification result to obtain a data source clustering result, and the method comprises the following steps:
defining a characteristic quantization index system of a data source, wherein the characteristic quantization index system comprises characteristic quantization indexes including data size, access delay, update frequency and safety;
Based on the data source classification result and the characteristic quantization index system, carrying out standard quantization processing on the first characteristic information of all the data sources in each category by adopting a standardized technology to obtain characteristic values of all the data sources in each category;
for each category, a hierarchical clustering algorithm is introduced to determine the grouping number of the data source, and the grouping number is used as the value of the initial input parameter of the K-means clustering algorithm;
Selecting an initial center point in a probability weighting mode, executing a K-means clustering process by adopting a K-means clustering algorithm based on the value of an initial input parameter and the initial center point, and introducing a dynamic adjustment mechanism of distance measurement in the clustering process to obtain a data source clustering result;
And evaluating the data source clustering result by using a contour coefficient method to obtain a contour coefficient value, and adjusting the value of the initial input parameter or optimizing the distance metric standard under the condition that the contour coefficient value is smaller than a preset threshold value, and repeatedly executing the K-means clustering process and the evaluation operation until the contour coefficient value is larger than or equal to the preset threshold value.
Optionally, a dynamic adjustment mechanism of distance measurement is introduced in the clustering process to obtain a data source clustering result, wherein the dynamic adjustment mechanism of distance measurement indicates that different distance measurement standards are adopted for data sources with different characteristic types, and the method comprises the following steps:
constructing a distance metric library by adopting various distance metrics, wherein the distance metric library comprises Euclidean distance, manhattan distance, cosine similarity distance and dynamic time warping distance;
Based on the initial characteristic classification result, selecting a distance metric corresponding to each classification in the initial characteristic classification result from the distance metric library by adopting a self-adaptive distance metric selection algorithm;
Assigning corresponding weights to all characteristic quantization indexes in each category according to the characteristic importance, and dynamically adjusting the distance measurement standard corresponding to each category according to the weights corresponding to all characteristic quantization indexes in each category to obtain the adjusted distance measurement standard corresponding to each category;
and determining a data source clustering result based on the adjusted distance measurement standards corresponding to all the classifications.
Optionally, optimizing the selection of the initial center point of the target clustering algorithm in the clustering process by adopting a genetic algorithm based on the association relationship between the data sources to obtain an optimized data source clustering result, including:
initializing a population to obtain an initial population, wherein each individual in the initial population represents the position of a center point to be determined of a group of data sources;
Calculating the fitness value of each individual in the population based on the fitness function, wherein the population is an initial population in the first iteration process and is an updated population in other times of iteration processes;
Selecting based on the fitness value of each individual to obtain an optimized population, and performing cross operation and mutation operation on the individuals in the optimized population to obtain an updated population;
Judging whether a preset iteration stopping condition is reached, if not, re-executing the calculation step, the selection operation, the cross operation, the mutation operation and the judgment step until the preset iteration stopping condition is reached, wherein the preset iteration stopping condition is that the maximum iteration times or the genetic algorithm convergence is reached.
Optionally, according to the analysis result, in combination with the personalized requirement information of enterprise operation and maintenance and the historical data collection mode, the optimal data collection mode corresponding to the data source set is predicted by adopting a long-term and short-term memory network, including:
based on the analysis result, generating a comprehensive demand feature matrix by combining the enterprise operation and maintenance personalized demand information;
based on the comprehensive demand feature matrix, generating time sequence features of a historical acquisition mode by adopting a time sequence analysis method;
Based on the comprehensive demand characteristic matrix and the time sequence characteristics of the historical acquisition mode, the optimal data acquisition mode corresponding to the data source set is predicted by adopting a long-short-period memory network.
In a second aspect, an embodiment of the present application provides a data acquisition system based on big data of an intelligent operation and maintenance platform, including:
The system comprises an acquisition and screening module, a data source collection module and a data source management module, wherein the acquisition and screening module is used for acquiring configuration files of a plurality of data sources connected with the intelligent operation and maintenance platform, and screening target data sources from the plurality of data sources by using a multi-level rule engine based on the configuration files of the plurality of data sources to form the data source collection;
the prediction module is used for predicting an optimal data acquisition mode corresponding to the data source set by adopting a target prediction model, wherein the target prediction model is introduced with a support vector machine, a target clustering algorithm and a long-term and short-term memory network;
The classification acquisition module is used for combining a distributed computing framework of the intelligent operation and maintenance platform, distributing data acquisition tasks of a data source set to a plurality of devices in the intelligent operation and maintenance platform according to an optimal data acquisition mode, and enabling the plurality of devices to acquire original data acquired by all data sources in the data source set in parallel;
And the processing determining module is used for processing the original data and determining an operation and maintenance result according to the processed data and the operation and maintenance service requirement information.
In a third aspect, an embodiment of the present application provides a computing device, including a processing component and a storage component, where the storage component stores one or more computer instructions, and the one or more computer instructions are used to be invoked and executed by the processing component to implement a data collection method based on big data of an intelligent operation and maintenance platform according to any one of the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer storage medium storing a computer program, where the computer program when executed by a computer implements a data collection method based on big data of an intelligent operation and maintenance platform according to any one of the first aspects.
The embodiment of the application provides a data acquisition method based on big data of an intelligent operation and maintenance platform, which comprises the steps of acquiring configuration files of a plurality of data sources connected with the intelligent operation and maintenance platform, and screening target data sources from the plurality of data sources by using a multi-level rule engine based on the configuration files of the plurality of data sources to form a data source set; the method comprises the steps of predicting an optimal data acquisition mode corresponding to a data source set by using a target prediction model, wherein the target prediction model is introduced with a support vector machine, a target clustering algorithm and a long-term and short-term memory network, distributing data acquisition tasks of the data source set to a plurality of devices in an intelligent operation and maintenance platform according to the optimal data acquisition mode by combining a distributed computing framework of the intelligent operation and maintenance platform, enabling the plurality of devices to acquire original data acquired by all data sources in the data source set in parallel, processing the original data, and determining operation and maintenance results according to the processed data and operation and maintenance service requirement information.
According to the embodiment, the multi-layer rule engine can flexibly set multi-layer rules according to different operation and maintenance requirements and scenes, and accuracy and applicability of screening results are ensured. The target prediction model in the embodiment combines a support vector machine, a target clustering algorithm and a long-term and short-term memory network, so that an optimal data acquisition mode can be intelligently predicted, and the accuracy and the flexibility of data acquisition are improved. Specifically, the embodiment uses the support vector machine to perform classification analysis on the characteristics of the data sources, so that different types of data sources, such as structured data, unstructured data, static data, dynamic data and the like, can be accurately distinguished. According to the embodiment, the data sources are classified into different groups according to the characteristics (such as data size, access delay and the like) of the data sources through a target clustering algorithm, so that finer-granularity characteristic analysis is facilitated. According to the method, the time sequence analysis capability of the long-period and short-period memory network is utilized, the optimal data acquisition mode is predicted by combining the historical data acquisition mode and the enterprise operation and maintenance personalized demand, and the personalized data acquisition mode is provided according to the demands of different data sources and application scenes.
These and other aspects of the application will be more readily apparent from the following description of the embodiments.
Detailed Description
In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.
In some of the flows described in the specification and claims of the present application and in the foregoing figures, a plurality of operations appearing in a particular order are included, but it should be clearly understood that the operations may be performed in other than the order in which they appear herein or in parallel, the sequence numbers of the operations such as S11, S12, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
Fig. 1 is a flowchart of a data collection method based on big data of an intelligent operation and maintenance platform according to an embodiment of the present application, as shown in fig. 1, the method includes:
S11, acquiring configuration files of a plurality of data sources connected with the intelligent operation and maintenance platform, and screening target data sources from the plurality of data sources by using a multi-level rule engine based on the configuration files of the plurality of data sources to form a data source set.
It should be appreciated that the present embodiment utilizes an automation tool (e.g., a network scanning tool) of the intelligent operation and maintenance platform to automatically identify and configure various data sources by scanning the network environment, reading configuration files, and the like. The data sources may include databases, log files, network interfaces, and the like.
Specifically, the step S11 may include the step of scanning the network environment of the intelligent operation and maintenance platform by using a network scanning tool to obtain a scanning result including a data source list and detailed information thereof, where the detailed information includes an internet protocol (Internet Protocol, IP) address, a port number, a user name, a password, and the like. Step 112, reading and analyzing the existing configuration file by using a configuration file analyzer (such as regular expression, target analyzer, etc.), dynamically generating a new configuration file by using a template engine and a programming language (such as Python, java, etc.) according to the analyzed existing configuration information and the scanning result, and introducing a genetic algorithm in the generation process. The configuration file includes a data source type, an IP address, a port number, authentication information, etc. The target parser may be a JavaScript object notation (JavaScript Object Notation, JSON) parser, extensible markup language (eXtensible Markup Language, XML) parser, or the like, among others. And 113, carrying out grammar and logic verification on the new configuration file by using a configuration file verification tool to obtain a verified configuration file.
It should also be appreciated that the multi-level rules engine may be designed with preset, dynamic, and comprehensive rules layers. The preset rule layer may be built with a plurality of preset rules, including screening conditions of common data sources. The dynamic rule layer can intelligently and dynamically generate proper screening rules by utilizing a machine learning algorithm according to historical data and user behavior data so as to improve the accuracy and efficiency of screening. The comprehensive rule layer can combine preset rules and screening rules to form comprehensive screening rules, so that screening accuracy and screening efficiency are improved. For the dynamic rule layer, the intelligent operation and maintenance platform can intelligently recommend the most suitable screening rule according to historical data and user behaviors by using a machine learning algorithm (such as a decision tree, a random forest and the like), so that the screening accuracy and efficiency are improved. Wherein, the historical data comprises screening records, screening results, user feedback and the like in the past time period. The user behavior data includes screening habits, common rules, preference settings, etc. of the user.
Optionally, a powerful multi-level rule engine is built in the intelligent operation and maintenance platform, and the data sources can be screened according to preset rules, comprehensive screening rules and the like to obtain a data source set. These rules may be based on various conditions of data type, data format, time stamp, keywords, etc. Illustratively, the data type is a specified database or a log file of a particular type. The data format is a log file with a preset format. The time stamp is the data source updated in the last week. Keywords are numerical values within a particular word or symbol or interval.
S12, predicting an optimal data acquisition mode corresponding to the data source set by using a target prediction model, wherein the target prediction model is introduced with a support vector machine, a target clustering algorithm and a long-term and short-term memory network.
Specifically, the embodiment can utilize a support vector machine and a K-means clustering algorithm to perform deep analysis on characteristics (such as data types, data amounts, network environments and the like) of the data sources based on the data source set, so as to obtain a detailed analysis report. Illustratively, the support vector machine is used to classify the data source types, so as to ensure that different types of data correspond to different processing modes. And grouping the data sources by using a K-means clustering algorithm to obtain a data source clustering result. And predicting an optimal data acquisition mode by using a long-term and short-term memory network according to the data source clustering result.
S13, combining the distributed computing framework of the intelligent operation and maintenance platform, distributing the data acquisition tasks of the data source set to a plurality of devices in the intelligent operation and maintenance platform according to an optimal data acquisition mode, so that the plurality of devices can acquire the original data acquired by all the data sources in the data source set in parallel.
It should be appreciated that the intelligent operation and maintenance platform supports a variety of data access modes including application programming interface (Application Programming Interface, API) calls, file reads, message queues, and the like. The intelligent operation and maintenance platform can automatically select an optimal data acquisition mode, and high efficiency and reliability of data acquisition are ensured. The distributed computing framework of the intelligent operation and maintenance platform can be APACHE SPARK, apache Hadoop or Spark type distributed computing framework, and can also be other types of frameworks, and the structure of the framework is not particularly limited in this embodiment. For example, the step can be performed by using APACHE SPARK distributed computing frames to allocate data collection tasks for data in the data source to multiple devices of the intelligent operation and maintenance platform in parallel according to the optimal data collection mode predicted in the previous step. Accordingly, by using the distributed computing framework, the intelligent operation and maintenance platform can process data acquisition tasks of a plurality of data sources in parallel, and the speed and concurrency of data acquisition are improved.
Optionally, the embodiment can dynamically adjust the task allocation strategy according to the load condition and the network condition of each device by combining the self-adaptive task scheduling algorithm and the reinforcement learning technology aiming at the data acquisition task of the data source set to obtain the task scheduling result, and the task scheduling result can ensure that each data acquisition task can be efficiently executed on the most suitable device. In addition, the embodiment can further optimize the allocation of computing resources based on the task scheduling result by combining a resource management algorithm and a linear programming method, and obtain an optimized task scheduling result. The embodiment reduces resource waste while improving the resource utilization rate, and ensures that the optimal data acquisition effect is achieved under the limited resources.
S14, processing the original data, and determining an operation and maintenance result according to the processed data and the operation and maintenance service requirement information.
It should be appreciated that the purpose of data collection is to support various functions of intelligent operation and maintenance, such as fault detection, performance optimization, resource management, etc. Therefore, after the data acquisition task is performed, the embodiment can transmit the acquired original data to the data processing module. The intelligent operation and maintenance platform can adopt high-efficiency data transmission protocols (such as Kafka, flume and the like) to ensure the stability and speed of data transmission. Meanwhile, the intelligent operation and maintenance platform supports data compression and breakpoint continuous transmission, and the integrity and reliability of data transmission are guaranteed. The data processing module in the intelligent operation and maintenance platform can process the original data in real time, and illustratively, a stream processing framework (such as APACHE FLINK and Storm) is utilized for data cleaning, formatting and preliminary analysis, so that the instant availability of the data is ensured. Still further exemplary, the present embodiment may employ a data cleansing technique to remove invalid or erroneous data from the collected raw data, and then use a feature selection algorithm to pick out the most valuable information. And finally, identifying potential problems or risk points through an anomaly detection algorithm, and providing accurate data support for operation and maintenance decision.
Alternatively, the present embodiment may store the processed data in a designated data warehouse or database. Specifically, the intelligent operation and maintenance platform supports various data storage modes, including relational databases (such as MySQL, postgreSQL), noSQL databases (such as MongoDB, cassandra), data warehouses (such as Hive and Amazon Redshift), and the like. The intelligent operation and maintenance platform can automatically select the most suitable storage mode according to the data characteristics and the application scene. The intelligent operation and maintenance platform can automatically create data indexes and partitions, and improves the performance of data query and analysis. Meanwhile, the intelligent operation and maintenance platform supports data life cycle management, automatically files and deletes expired data, and saves storage space.
According to the embodiment, the multi-layer rule engine can flexibly set multi-layer rules according to different operation and maintenance requirements and scenes, and accuracy and applicability of screening results are ensured. The target prediction model in the embodiment combines a support vector machine, a target clustering algorithm and a long-term and short-term memory network, so that an optimal data acquisition mode can be intelligently predicted, and the accuracy and the flexibility of data acquisition are improved. Specifically, the embodiment uses the support vector machine to perform classification analysis on the characteristics of the data sources, so that different types of data sources, such as structured data, unstructured data, static data, dynamic data and the like, can be accurately distinguished.
In some possible embodiments, S12, predicting an optimal data acquisition mode corresponding to the data source set by using a target prediction model, wherein the target prediction model is introduced with a support vector machine, a target clustering algorithm and a long-term and short-term memory network, and the method comprises the following steps:
And 121, analyzing the characteristics of each data source in the data source set by using a support vector machine and a target clustering algorithm based on the configuration file of each data source in the data source set to obtain an analysis result.
And step 122, according to the analysis result, combining the enterprise operation and maintenance personalized demand information and the historical data acquisition mode, and adopting an optimal data acquisition mode corresponding to the long-term and short-term memory network prediction data source set.
The present embodiment classifies data sources into different groups (or groups) according to characteristics of the data sources (such as data size, access delay, etc.) through a target clustering algorithm, so as to facilitate finer-grained characteristic analysis. According to the method, the time sequence analysis capability of the long-period and short-period memory network is utilized, the optimal data acquisition mode is predicted by combining the historical data acquisition mode and the enterprise operation and maintenance personalized demand, and the personalized data acquisition mode is provided according to the demands of different data sources and application scenes.
In the foregoing embodiment, as a possible implementation manner, step 121, based on the configuration file of each data source in the data source set, uses a support vector machine and a target clustering algorithm to analyze the characteristics of each data source in the data source set to obtain an analysis result, where the analysis result includes:
And a1, extracting first characteristic information of each data source in the data source set by adopting a metadata extraction technology based on configuration files of each data source in the data source set. The first characteristic information includes, but is not limited to, data type, data size, update frequency, network environment parameters, etc., and provides data support for subsequent characteristic analysis.
And a2, based on the first characteristic information of each data source in the data source set, classifying the class attribute of all the data sources in the data source set by adopting a support vector machine to obtain a data source classification result. The classification process is used for evaluating similarity and difference among different data sources, and the data source classification result is used for distinguishing a structured data source from a unstructured data source or distinguishing a static data source from a dynamic data source and the like.
And a3, based on the data source classification result, clustering all the data sources in each category by adopting a target clustering algorithm to obtain a data source clustering result. Target clustering algorithms include, but are not limited to, the K-means clustering algorithm. In the clustering process, the data sources can be classified into different groups according to factors such as data size, access delay and the like, so that finer characteristic analysis is facilitated.
And a step a4 of selecting partial characteristic information from the first characteristic information by adopting a principal component analysis technology based on the clustering result of the data sources, taking the partial characteristic information as second characteristic information, and generating the association relation between the data sources in each category by adopting an association rule learning algorithm based on the second characteristic information. It should be understood that the principal component analysis technique eliminates redundant and irrelevant characteristic information, retains the key characteristic with the greatest influence on the clustering result, can reduce characteristic dimension, and simultaneously keeps the key characteristic of the data source unchanged, so as to improve analysis efficiency and optimize performance of the support vector machine and the target clustering algorithm. Association rule learning algorithms may explore implicit relationships that may exist between data sources, such as where some data sources are updated simultaneously under certain conditions, or where some types of data tend to occur in certain network environments, which findings facilitate a deep understanding of the data source's mode of operation.
And a5, optimizing the selection of initial center points of a target clustering algorithm in the clustering process by adopting a genetic algorithm based on the association relation between the data sources so as to obtain an optimized data source clustering result. The selection of the initial center in the K-means clustering process is optimized by adopting a genetic algorithm, so that the clustering effect can be enhanced, the data sources in each group are ensured to have high similarity, and the data source differences among different groups are obvious.
And a step a6 of constructing a characteristic probability model of the data source by adopting a Bayesian network based on the optimized data source clustering result so as to form a data source behavior prediction model. And constructing a characteristic probability model of the data sources by adopting a Bayesian network, and predicting the behavior mode of a certain data source under a specific condition by analyzing the condition dependency relationship among the data sources so as to provide a basis for the selection and the priority ordering of the data sources.
And a step a7 of generating an analysis result based on the configuration file of each data source in the data source set, the first characteristic information of each data source in the data source set, the data source classification result, the data source clustering result, the second characteristic information, the association relation among the data sources, the optimized data source clustering result and the data source behavior prediction model. It should be appreciated that the analysis results may be indicative of the results of a characteristic analysis of the data source. The analysis result not only comprises the characteristic quantitative analysis result of the data source, but also comprises characteristic qualitative evaluation and suggestion, such as which data source is most suitable for real-time processing, which data source is more suitable for batch processing, and the like, so that comprehensive data support is provided for a decision maker.
Through the refinement steps, the embodiment not only deepens the understanding of the characteristic analysis of the data source, but also ensures the scientificity and rationality of the analysis process by introducing various algorithms and technical means, and provides more accurate characteristic analysis results of the data source.
In the embodiment, the target clustering algorithm comprises a K-means clustering algorithm, and correspondingly, step a3, based on the data source classification result, adopts the target clustering algorithm to perform clustering processing on all the data sources in each category to obtain the data source clustering result, and comprises the following steps:
Step a31, defining a characteristic quantization index system of the data source, wherein the characteristic quantization index system comprises the following characteristic quantization indexes of data size, access delay, update frequency and safety. According to the embodiment, each characteristic quantization index can be ensured to accurately reflect the characteristics of the data source, and accurate data input is provided for subsequent cluster analysis.
And a32, carrying out standard quantization processing on the first characteristic information of all the data sources in each category by adopting a standardized technology based on the data source classification result and the characteristic quantization index system, so as to obtain the characteristic values of all the data sources in each category. And processing each item of first characteristic information by adopting a standardized technology, and eliminating deviation caused by different dimensions so as to ensure that the K-means clustering algorithm can fairly consider the influence of each characteristic quantization index and improve the accuracy of a clustering result.
Step a33, introducing a hierarchical clustering algorithm to determine the number of packets of the data source according to each category, and taking the number of packets as the value of the initial input parameter of the K-means clustering algorithm. The number of packets is alternatively referred to as the number of groups. Specifically, the embodiment can introduce a hierarchical clustering algorithm as a preprocessing step of the K-means clustering algorithm, firstly preliminarily determine the grouping number of the data sources through hierarchical clustering, and then use the grouping number as an input parameter K of the K-means clustering algorithm so as to solve the problem that the K-means clustering algorithm needs to be pre-assigned with a K value, and improve the effectiveness and rationality of clustering.
Step a34, selecting an initial center point in a probability weighting mode, executing a K-means clustering process by adopting a K-means clustering algorithm based on the value of an initial input parameter and the initial center point, and introducing a dynamic adjustment mechanism of distance measurement in the clustering process to obtain a data source clustering result, wherein the dynamic adjustment mechanism of the distance measurement indicates that different distance measurement standards are adopted for data sources of different characteristic types. It should be appreciated that by selecting the initial center point in a probability weighted manner, the bias caused by random selection can be reduced, and the quality and stability of clustering can be improved. The dynamic adjustment mechanism for introducing the distance measurement means that the distance measurement standard is dynamically adjusted according to the actual difference between the data source characteristics, for example, euclidean distance can be adopted for high-dimensional characteristics, and dynamic time warping distance can be adopted for time sequence characteristics, so that the requirements of different types of data source characteristics can be met, and the clustering precision can be further improved.
Specifically, in step a34, a dynamic adjustment mechanism of distance measurement is introduced in the clustering process to obtain a data source clustering result, wherein the dynamic adjustment mechanism of distance measurement indicates that different distance measurement standards are adopted for data sources with different characteristic types, and the method comprises the following steps:
Step a341, constructing a distance metric library by adopting various distance metrics, wherein the distance metric library comprises Euclidean distance, manhattan distance, cosine similarity distance and dynamic time warping distance.
And a342, classifying all characteristic quantization indexes in a characteristic quantization index system by adopting a characteristic engineering method to obtain an initial characteristic classification result, and selecting a distance metric corresponding to each classification in the initial characteristic classification result from a distance metric library by adopting a self-adaptive distance metric selection algorithm based on the initial characteristic classification result. In the embodiment, various distance measurement standards are introduced, and a self-adaptive distance measurement selection algorithm is designed, so that the most suitable distance measurement standard can be dynamically selected according to the actual condition of the characteristics of the data source in the data source clustering process, and the accuracy and the reliability of the data source clustering result are improved.
And a343, endowing corresponding weights to all characteristic quantization indexes in each category according to the characteristic importance, and dynamically adjusting the distance measurement standard corresponding to each category according to the weights corresponding to all characteristic quantization indexes in each category to obtain the adjusted distance measurement standard corresponding to each category.
Step a344, determining the clustering result of the data source based on the adjusted distance measurement standards corresponding to all the classifications.
By introducing a dynamic adjustment mechanism of distance measurement, the embodiment can dynamically select the most suitable distance measurement standard according to the actual condition of the characteristics of the data source in the data source clustering process. Specifically, the mechanism selects the most suitable distance metric from the standard library by constructing a plurality of distance metric libraries (comprising Euclidean distance, manhattan distance, cosine similarity distance and dynamic time warping distance), classifying the characteristic quantization indexes by adopting a characteristic engineering method and combining an adaptive distance metric selection algorithm. In addition, a weight is given to the characteristic quantization index in each category according to the characteristic importance, and the distance measurement standard corresponding to each category is dynamically adjusted, so that the accuracy and the reliability of the clustering result of the data source are ensured. The embodiment not only improves the accuracy of data source clustering, but also enhances the robustness and generalization capability of the model, so that the data source clustering result is more in line with the characteristics of actual data, and a reliable basis is provided for subsequent data analysis and application.
And a35, evaluating the data source clustering result by using a contour coefficient method to obtain a contour coefficient value, and adjusting the value of an initial input parameter or optimizing a distance metric standard under the condition that the contour coefficient value is smaller than a preset threshold value, and repeatedly executing a K-means clustering process and an evaluation operation until the contour coefficient value is larger than or equal to the preset threshold value.
It should be understood that after clustering is completed, the clustering result of the data sources is evaluated by using a contour coefficient method, and the clustering quality is measured by calculating the distance ratio of each data source relative to the cluster in which the data source is located and other clusters. If the profile coefficient is lower, the clustering process is re-executed by considering the adjustment of the K value or the optimization of the distance measurement standard until a satisfactory data source clustering result is obtained.
Optionally, based on the optimized distance measurement standard, the stability and the robustness of the data source clustering result are tested by adopting a method of multiple operation, noise test and parameter sensitivity analysis, so that a stable and robust data source clustering result is obtained.
Through the steps, the applicability and the accuracy of the K-means clustering algorithm can be improved through the application of the standardized technology, the hierarchical clustering algorithm, the K-means clustering algorithm and the dynamic adjustment mechanism introducing the distance measurement, and meanwhile the rationality of the clustering analysis process is ensured.
Specifically, step a5, optimizing the selection of an initial center point of a target clustering algorithm in the clustering process by adopting a genetic algorithm based on the association relation between data sources to obtain an optimized data source clustering result, wherein the method comprises the following steps:
Step a51, initializing a population to obtain an initial population, wherein each individual in the initial population represents the position of a center point to be determined of a group of data sources. Population size may be determined based on the number of data sources in a category and the characteristic dimension.
And a52, calculating the fitness value of each individual in the population based on the fitness function, wherein the population is an initial population in the first iteration process and is an updated population in other times of iteration processes. The fitness function may select the individual with the highest fitness as the current optimal solution based on the compactness and the degree of separation of the clusters, etc.
Step a53, performing a selection operation based on the fitness value of each individual to obtain an optimized population, and performing a crossover operation and a mutation operation on the individuals in the optimized population to obtain an updated population. Specifically, the embodiment adopts the crossover operation of a genetic algorithm, generates new individuals by selecting two individuals with higher adaptability to carry out gene exchange, increases the diversity of the population to obtain the crossed population, adopts the mutation operation of the genetic algorithm based on the crossed population to randomly change part of genes of some individuals, prevents the algorithm from converging prematurely, maintains the exploratory capacity of the population, and obtains the updated population.
And a step a54 of judging whether a preset iteration stopping condition is reached, if not, re-executing the calculation step, the selection operation, the cross operation, the variation operation and the judgment step in the step a 52-a 54 until the preset iteration stopping condition is reached, wherein the preset iteration stopping condition is that the maximum iteration times or the genetic algorithm convergence is reached.
Correspondingly, the quality and stability of the data source clustering result can be improved by optimizing the selection of the initial center point of the target clustering algorithm in the clustering process through the genetic algorithm. Specifically, in this embodiment, by initializing a population and calculating an fitness value of each individual based on a fitness function, an individual with the highest fitness is selected as a current optimal solution. Then, the population is continuously optimized through selection, crossing and mutation operations, so that the diversity and exploratory capacity of the population are increased, and premature convergence of the algorithm is prevented. The embodiment not only improves the rationality of the initial center point selection, but also enhances the robustness and accuracy of the clustering algorithm, and finally obtains the optimized data source clustering result. The clustering result of the data sources is more stable and reliable, the actual association relation between the data sources can be reflected better, and a solid foundation is provided for subsequent data analysis and application.
In the foregoing embodiment, as a possible implementation manner, step 122, according to the analysis result, combines the personalized requirement information of the enterprise operation and maintenance and the historical data collection manner, adopts the optimal data collection manner corresponding to the long-term memory network prediction data source set, and includes:
and b1, generating a comprehensive demand feature matrix by combining the enterprise operation and maintenance personalized demand information based on the analysis result.
And b2, generating time sequence features of a historical acquisition mode by adopting a time sequence analysis method based on the comprehensive demand feature matrix.
And b3, based on the comprehensive demand feature matrix and the time sequence features of the historical acquisition mode, adopting an optimal data acquisition mode corresponding to the long-term and short-term memory network prediction data source set.
Based on analysis results and enterprise operation and maintenance personalized demand information, the comprehensive demand feature matrix is generated, so that the model can comprehensively consider specific demands and actual conditions of enterprises, and the pertinence and practicability of prediction are improved. Based on the comprehensive demand feature matrix, a time sequence analysis method is adopted to generate time sequence features of a historical acquisition mode, so that the time dependence and the change trend of historical data are fully utilized, and the prediction capability of a model is enhanced. The long-term and short-term memory network improves the accuracy of the predicted data acquisition mode and provides more efficient and reliable operation and maintenance support for enterprises.
In summary, the present embodiment has the following advantages:
(1) Most existing data acquisition methods rely on manual configuration and selection of acquisition modes, and lack an intelligent automatic selection mechanism. The intelligent operation and maintenance platform in the embodiment automatically selects the optimal data acquisition mode according to the characteristics of the data source (such as data type, data volume, network environment and the like) by an automatic tool, a support vector machine and a K-means clustering algorithm and combining the enterprise operation and maintenance personalized demand information and the historical data acquisition mode, so that manual intervention is reduced, and acquisition efficiency and reliability are improved.
(2) The existing data acquisition method generally adopts single-equipment or multi-equipment serial processing, and cannot fully utilize the computing resources of a plurality of equipment, so that the data acquisition speed is limited. The intelligent operation and maintenance platform in the embodiment utilizes the distributed computing framework to distribute the data acquisition tasks to a plurality of devices in parallel for execution, so that the speed and concurrency of data acquisition are greatly improved. The distributed processing mode not only can process large-scale data, but also can ensure the real-time performance and high efficiency of data acquisition.
(3) The existing data acquisition method often lacks an intelligent recommendation mechanism, and a user needs to manually configure and adjust acquisition parameters, so that errors are easy to occur, and the efficiency is low. The intelligent operation and maintenance platform in the embodiment is internally provided with a multi-level rule engine, and a dynamic rule layer in the multi-level rule engine can intelligently recommend the most suitable screening rule according to historical data and user behavior data by utilizing a machine learning algorithm, so that the accuracy and the efficiency of data acquisition are further improved.
Fig. 2 is a schematic structural diagram of a data acquisition system based on big data of an intelligent operation and maintenance platform according to an embodiment of the present application, as shown in fig. 2, the system includes:
The acquiring and filtering module 21 is configured to acquire configuration files of a plurality of data sources connected to the intelligent operation and maintenance platform, and screen a target data source from the plurality of data sources by using a multi-level rule engine based on the configuration files of the plurality of data sources, so as to form a data source set.
The prediction module 22 is configured to predict an optimal data acquisition mode corresponding to the data source set by using a target prediction model, where the target prediction model is introduced with a support vector machine, a target clustering algorithm and a long-term and short-term memory network.
The classification acquisition module 23 is configured to combine with the distributed computing framework of the intelligent operation and maintenance platform, and distribute the data acquisition task of the data source set to multiple devices in the intelligent operation and maintenance platform according to an optimal data acquisition mode, so that the multiple devices can acquire the original data acquired by all the data sources in the data source set in parallel.
The processing determining module 24 is configured to process the raw data, and determine an operation and maintenance result according to the processed data and the operation and maintenance service requirement information.
The data collection system based on the big data of the intelligent operation and maintenance platform described in fig. 2 may execute the data collection method based on the big data of the intelligent operation and maintenance platform described in the embodiment shown in fig. 1, and its implementation principle and technical effects are not described again. The specific manner in which the modules and units perform the operations in the data acquisition system based on big data of the intelligent operation and maintenance platform in the foregoing embodiments has been described in detail in the embodiments related to the method, and will not be described in detail herein.
In one possible design, the smart operation and maintenance platform big data based data acquisition system of the embodiment shown in fig. 2 may be implemented as a computing device, which may include a storage component 31 and a processing component 32, as shown in fig. 3.
The storage component 31 stores one or more computer instructions for execution by the processing component 32.
The processing component 32 is configured to obtain configuration files of a plurality of data sources connected to the intelligent operation and maintenance platform, screen target data sources from the plurality of data sources by using a multi-level rule engine based on the configuration files of the plurality of data sources to form a data source set, predict an optimal data acquisition mode corresponding to the data source set by using a target prediction model, wherein the target prediction model is introduced with a support vector machine, a target clustering algorithm and a long-short-term memory network, combine a distributed computing framework of the intelligent operation and maintenance platform, distribute data acquisition tasks of the data source set to a plurality of devices in the intelligent operation and maintenance platform according to the optimal data acquisition mode, so that the plurality of devices can acquire original data acquired by all the data sources in the data source set in parallel, process the original data, and determine operation and maintenance results according to the processed data and operation and maintenance service requirement information.
Wherein the processing component 32 may include one or more processors to execute computer instructions to perform all or part of the steps of the methods described above. Of course, the processing component may also be implemented as one or more Application-specific integrated circuits (ASICs), digital signal processors (DIGITAL SIGNAL processes, DSPs), digital signal processing devices (DIGITAL SIGNAL Process devices, DSPDs), programmable logic devices (Programmable Logic Device, PLDs), field programmable gate arrays (Field Programmable GATE ARRAY, FPGA), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above method.
The storage component 31 is configured to store various types of data to support operations at the terminal. The Memory component may be implemented by any type or combination of volatile or nonvolatile Memory devices such as Random Access Memory (Random Access Memory, RAM), static Random-Access Memory (SRAM), electrically erasable programmable Read-Only Memory (EEPROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk.
Of course, the computing device may necessarily include other components as well, such as input/output interfaces, display components, communication components, and the like.
The input/output interface provides an interface between the processing component and a peripheral interface module, which may be an output device, an input device, etc.
The communication component is configured to facilitate wired or wireless communication between the computing device and other devices, and the like.
The computing device may be a physical device or an elastic computing host provided by the cloud computing platform, and at this time, the computing device may be a cloud server, and the processing component, the storage component, and the like may be a base server resource rented or purchased from the cloud computing platform.
The embodiment of the application also provides a computer storage medium which stores a computer program, and the computer program can realize the data acquisition method based on the big data of the intelligent operation and maintenance platform in the embodiment shown in the figure 1 when being executed by a computer.
It will be clear to those skilled in the art that, for convenience and brevity of description, the specific working procedures of the above-described system and unit may refer to the corresponding procedures in the foregoing method embodiments, which are not repeated here.
The system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application.