[go: up one dir, main page]

CN119513503A - A data conversion method, device, medium and equipment based on machine learning - Google Patents

A data conversion method, device, medium and equipment based on machine learning Download PDF

Info

Publication number
CN119513503A
CN119513503A CN202510080610.5A CN202510080610A CN119513503A CN 119513503 A CN119513503 A CN 119513503A CN 202510080610 A CN202510080610 A CN 202510080610A CN 119513503 A CN119513503 A CN 119513503A
Authority
CN
China
Prior art keywords
data
rule
model
real
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510080610.5A
Other languages
Chinese (zh)
Inventor
蒋海峰
李严
袁涛
李亚楠
方亮
周群
王宁宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Communications Information Technology Group Co ltd
Original Assignee
China Communications Information Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Communications Information Technology Group Co ltd filed Critical China Communications Information Technology Group Co ltd
Priority to CN202510080610.5A priority Critical patent/CN119513503A/en
Publication of CN119513503A publication Critical patent/CN119513503A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/72Data preparation, e.g. statistical preprocessing of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/771Feature selection, e.g. selecting representative features from a multi-dimensional feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a data conversion method, a device, a medium and equipment based on machine learning, wherein the method comprises the steps of obtaining original data and preprocessing, carrying out feature selection on the preprocessed data based on a feature selection algorithm, inputting the data subjected to feature selection into a corresponding trained rule generation model based on data types to generate data conversion rules, respectively constructing and training the rule generation model according to the different data types, obtaining a real-time data stream, optimizing the rules based on a trained rule optimization model and/or carrying out self-adaptive adjustment on the rules according to real-time monitoring data, processing the data through the adjusted rules, and carrying out anomaly detection and correction on the data subjected to rule processing. According to the method, an automatic rule is generated through real-time data pattern recognition, rule optimization and comprehensive data detection and correction are performed in a self-adaptive mode, and the efficiency and accuracy of data processing are remarkably improved.

Description

Data conversion method, device, medium and equipment based on machine learning
Technical Field
The invention relates to the technical field of data processing and analysis, in particular to a data conversion method, device, medium and equipment based on machine learning.
Background
With the rapid development of information technology, data is becoming an important asset for modern enterprises and organizations. However, these data often originate from different systems, are not in the same format, are of varying quality, and present a significant challenge for the efficient use of the data. Traditional data processing methods rely on manually defined rules, which are not only inefficient, but also difficult to accommodate rapid changes in data, which can easily lead to errors and inconsistencies in data processing.
The prior art mainly has the following problems:
1. the traditional method relies on manually defined data conversion rules, so that the rules are difficult to maintain and update;
2. The data quality problem is that noise, missing values and abnormal values often exist in the data, and the accuracy of the data is affected;
3. The prior art has limited capability in data pattern recognition, and potential value in data cannot be effectively mined;
4. the real-time performance is poor, and the data processing cannot be performed in real time, so that information is lagged.
Disclosure of Invention
Based on the background, the invention aims to provide a data conversion method based on machine learning, which is applied to scenes such as big data processing, data cleaning, enhancement, pattern recognition, multi-source data fusion and the like, and aims to solve the limitations of the traditional data processing method and improve the efficiency and accuracy of data conversion.
In order to achieve the above purpose, the invention adopts the following technical scheme:
In a first aspect, the present invention provides a machine learning based data conversion method, comprising,
Acquiring original data and preprocessing the original data;
Performing feature selection on the preprocessed data based on a feature selection algorithm;
Inputting the data after feature selection into a corresponding trained rule generation model based on the data type to generate a corresponding data conversion rule, wherein the rule generation model is respectively constructed and trained according to the data type and comprises an image data rule generation model and a text data rule generation model;
Acquiring a real-time data stream, optimizing rules based on a trained rule optimization model, and/or adaptively adjusting the rules by defining a rule adjustment function according to real-time monitoring data;
and processing the data through the regulated rule, and detecting and correcting the abnormality of the data processed by the rule.
In a second aspect, the present invention provides a machine learning based data conversion apparatus comprising,
The original data acquisition module is used for acquiring original data and preprocessing the original data;
the feature selection module is used for carrying out feature selection on the data through a feature selection algorithm and extracting a feature subset from the original data;
the rule generation module is used for acquiring the data after feature selection and generating a corresponding data conversion rule;
The rule optimization module is used for defining a reward function based on a reinforcement learning algorithm to optimize the rule and adaptively adjusting the rule according to real-time monitoring data;
the data monitoring and correcting module is used for acquiring the data after the rule processing and carrying out abnormality detection and correction on the data after the rule processing.
In a third aspect, the present invention further provides a non-transitory computer readable storage medium, where at least one instruction or at least one program is stored, where the at least one instruction or the at least one program is loaded and executed by a processor to implement the machine learning based data conversion method described above.
In a fourth aspect, the invention also provides an electronic device comprising a processor and the non-transitory computer readable storage medium described above.
The data conversion method based on machine learning provided by the invention has the following beneficial effects:
1. real-time data pattern recognition and efficient automatic rule generation, namely, by means of an advanced deep learning model, potential patterns and trends can be continuously monitored and recognized in a dynamic data stream for real-time analysis, timeliness and accuracy of data processing are ensured, and data conversion rules adapting to different data types and application scenes are quickly and automatically created, so that flexible coping can be ensured when data of different industries and fields are processed.
2. The self-adaptive rule optimizing mechanism is used for automatically monitoring the change of the data characteristics, and timely adjusting and optimizing the data processing rule, so that the applicability and accuracy of the rule are improved, and the system can be lifted by itself through continuous learning and feedback, so that the high-efficiency processing capability still can be maintained under the new data condition.
3. Comprehensive abnormality detection and automatic correction: errors are detected in time in a data stream and effectively and automatically corrected, so that data quality is ensured, specifically, the data cleaning process can automatically remove noise, fill missing values and generate synthetic data by applying advanced technologies such as natural language processing, generation of countermeasure networks and the like, thereby enriching diversity of data sets and improving data quality.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a machine learning based data conversion method according to an embodiment of the present invention;
Fig. 2 is a schematic diagram of a machine learning-based data conversion device according to an embodiment of the present invention.
Detailed Description
For a further understanding of the present invention, preferred embodiments of the invention are described below in conjunction with the examples, but it should be understood that these descriptions are merely intended to illustrate further features and advantages of the invention, and are not limiting of the claims of the invention.
Example 1
Referring to fig. 1, the data conversion method based on machine learning provided by the embodiment of the invention includes the following steps:
s1, acquiring original data, and preprocessing the original data;
S2, carrying out feature selection on the preprocessed data based on a feature selection algorithm;
S3, inputting the data with the characteristics selected into a rule generation model trained by the data of the corresponding type based on the data type to generate a corresponding data conversion rule, wherein the rule generation model is respectively constructed and trained according to different data types;
S4, acquiring a real-time data stream, optimizing rules based on a trained rule optimization model, and/or adaptively adjusting the rules according to real-time monitoring data;
s5, processing the data through the regulated rule, and detecting and correcting the abnormality of the data after the rule processing.
Wherein, the step S1 of acquiring data and preprocessing the data specifically comprises,
S101, acquiring original data from a data source;
The data type of the acquired data in the data source comprises structured data and unstructured data. Wherein structured data refers to data that can be organized in a predefined manner including database, CSV (Comma-SEPARATED VALUES) file data, unstructured data refers to data that does not follow a predefined format, including text data and image data.
S102, preprocessing the acquired original data, including data source identification, data metadata extraction, data cleaning, data format conversion and the like,
The data source identification is used for determining the source of data, including the time, place, mode and the like of data collection;
metadata extraction is the extraction of additional information about the original data, such as features, attributes, etc. of the data to aid in understanding and using the data;
Data cleaning, including noise removal, outlier identification, repeated data removal and the like, so as to improve data quality and ensure accuracy and effectiveness of subsequent data conversion;
Format conversion is the conversion of data from one data format to another, such as from a text format to a tabular format, or numerical data to a specific statistical data format. In practice, the data is converted to a data type suitable for analysis processing, selected based on specific requirements.
In step S2, feature selection is performed on the preprocessed data by a feature selection algorithm, so that a feature subset with the most relevant and most information content is extracted from the original data, so as to reduce the calculation cost and help understand the data.
The automatic feature selection algorithm can identify important feature variables, and the number of features is reduced, so that model overfitting is avoided. Common feature selection algorithms include filtration (Filter Methods), wrapping (Wrapper Methods), and embedding (Embedded Methods). The filtering method is to select the features by evaluating the relation between the features and the target variables, the common methods include a variance selection method, chi-square test, pearson correlation coefficient and the like, the wrapping method is to evaluate the performance of the feature subset by constructing a model, the best feature set is usually selected by using cross-validation, the common method is Recursive Feature Elimination (RFE), the embedding method is to embed the feature selection process into the model training process, and the common algorithms include LASSO, decision trees and the like. The choice of feature selection algorithm depends on the specific application scenario, the characteristics of the data set and the model used. In practical applications, it is necessary to combine specific application scenarios and domain knowledge to determine the most appropriate feature selection strategy. Feature selection not only can improve training efficiency and predictive performance of the model, but also can help understand patterns and relationships behind the data.
In this embodiment, to improve the automation degree of the feature engineering, some existing libraries, such as FeatureTools and TPOT, can be used to automatically complete the feature selection process. FeatureTools is an open source library that allows the user to automatically generate new features through a simple API.
In step S3, the data after feature selection is input into a rule generation model trained on the corresponding type data based on the data type, and rule generation is to identify patterns and trends from a large amount of data, discover useful information in the data set, including but not limited to classification, clustering, anomaly detection, prediction, and the like, and generate corresponding data conversion rules. The generated data transformation rules may help understand the intrinsic rules in the data or be used directly in decision support systems, automated processes, etc.
It should be noted that, the rule generating model is respectively constructed and trained according to different data types, and different types of data correspond to different rule generating models.
For structured data, the method can be constructed based on machine learning algorithms such as decision trees, random forests and the like, and the algorithms can learn the relation between the data characteristics and the target variables through training data so as to generate corresponding rules. Namely, by analyzing the data set, effective conversion rules are extracted, and flexible coping can be ensured when data of different industries and fields are processed. The decision tree algorithm is a simple and effective classification and regression model, the data can be divided into different categories through a series of rules, and the generated rules are easy to understand and can effectively process nonlinear relations. Common decision tree algorithms are ID3, C4.5, CART, etc. The decision tree construction process mainly comprises (1) selecting characteristics, namely selecting optimal segmentation characteristics by using indexes such as information gain, gain rate or coefficient of kunning, and the like, (2) creating nodes, namely dividing the nodes according to the selected characteristics to form a tree structure, and (3) recursively segmenting each subset, namely repeating the process until stopping conditions are met (for example, the maximum depth is reached or the number of the nodes is smaller than a threshold value). The random forest algorithm is an integrated learning method based on decision trees, and the accuracy and stability of the model are improved by constructing a plurality of decision trees and combining the results of the decision trees. The method can effectively reduce the overfitting and improve the generalization capability of the model on new data. The construction process of the random forest comprises (1) sampling Bootstrap, namely randomly extracting a plurality of samples from an original data set to form a plurality of different training sets, (2) constructing decision trees, namely constructing a decision tree for each training set, randomly selecting features to divide, and (3) voting mechanism, namely voting output results of all trees to determine a final classification result or regression value.
For text type data, since a Recurrent Neural Network (RNN) can efficiently process sequence data, model context, and characteristics of dynamic input length, a rule generation model of text data type can be constructed using the network structure. And the process of constructing a text data rule generation model based on a Recurrent Neural Network (RNN) includes,
1. Data preparation
Collecting text data meeting the requirements, and ensuring the diversity and representativeness of a data set;
data preprocessing, namely cleaning a text (removing impurities such as punctuation, special symbols and the like), segmenting words, extracting stems and the like;
encoding, converting text data into a digital form, wherein common methods comprise a bag of words model, TF-IDF, word2Vec, gloVe and the like. For RNNs, the choice is typically to convert words into a dense vector representation.
2. Building a model
The RNN structure of the recurrent neural network is selected, and the basic RNN, long short term memory network (LSTM) or gate-controlled recurrent units (GRU) can be selected. LSTM and GRU are generally more effective at capturing long range dependencies;
defining inputs and outputs, the inputs are typically a sequence (e.g., a sentence or a piece of text) and the outputs may be predictions of the next word or generation of the entire sentence.
3. Model training
Selecting proper loss function to measure the difference between model prediction and true value, selecting cross entropy loss function, and updating model parameters by combining optimization algorithm (such as Adam, SGD, etc.). The prepared dataset is divided into a training set and a validation set, the RNN model is trained through multiple iterations, and the loss function is monitored to avoid overfitting.
4. Model evaluation
Common metrics include accuracy, F1-score, confusion (perplexity), etc., using the validation set to evaluate the performance of the model. And according to the evaluation result, the super parameters (such as learning rate, hidden layer size, batch size, etc.) of the model are adjusted to improve the performance of the model.
Through the steps, a rule generation model capable of extracting features from text data and classifying or identifying the features can be constructed, and different generation strategies such as greedy search, beam search (beam search) or temperature sampling are adopted to generate rules of the text data.
For image type data, since Convolutional Neural Networks (CNNs) are suitable for processing images and time series data, spatial features can be automatically extracted, and therefore the network structure can be used to construct a rule generation model of the image data type. Further, the process of constructing the image data rule generating model comprises,
1. Acquiring historical data and determining a model structure according to the data type;
The method comprises the steps of obtaining historical data, dividing the data into a training set and a verification set according to a time window or data volume, determining a model to construct by adopting a CNN model structure according to data types, wherein the CNN model generally comprises basic structural units such as a convolution layer, a pooling layer, a full connection layer and the like. The model structure specifically built in the embodiment comprises a convolution layer, a pooling layer and a full connection layer, wherein the input of the model is image data with the shape adjusted, the image data is output as probability score of each category, and the output result of the model can be used for determining the category with the highest probability of the predicted category in the classification task as a prediction result.
2. The input features of the model are the original image data, and the labels are the corresponding category labels.
The method comprises adjusting the shape of the original image, wherein for a CNN model, the input data is usually a four-dimensional array, and the shape of the input data is expressed as (batch_size, height, width, channels), wherein batch_size is the number of samples representing the input model in one training, height represents the height of the image, width represents the width of the image, channel represents the number of channels of the image, 1 for a gray scale image, and 3 for a color image (RGB). In this embodiment, to adapt to the requirement of the CNN model, the original image data is adjusted to a shape suitable for the input of the CNN model, i.e., (-1, 28, 28, 1), wherein-1 represents the size of the dimension calculated automatically so that the number of elements of all channel numbers are consistent, the former 28 represents the height of the image as 28 pixels, the latter 28 represents the width of the image as 28 pixels, and 1 represents the channel number of the image as 1, i.e., the gray scale image.
Tag one-hot encoding-converting category tags into one-hot encoding (one-hot encoding) so that the model can distinguish between different categories. In the above embodiment, the tag is converted into a binary vector of length 10, where only one element is 1 and the remaining elements are 0, representing different categories.
3. The model is trained using training set data to optimize the model in combination with cross-entropy loss functions (cross-entropy loss) and optimization algorithms (e.g., adam, SGD). The cross entropy loss function is a commonly used classification problem loss function that measures the gap between predicted and true values. The optimization algorithm is used to update the parameters of the model to minimize the loss function. The loss function formula is in the form of a cross entropy loss function and is used for measuring the difference between the predicted value and the true value, for the two classification problems, the commonly used cross entropy loss function is a softmax loss function, but for the multiple classification problems, the cross entropy loss function is generally used for measuring the difference between the probability distribution of the predicted class and the true class.
4. The validation set adjusts the superparameter to avoid overfitting by adjusting the superparameter (e.g., learning rate, lot size, etc.) on the validation set. The validation set is a different data set than the training set for evaluating the performance of the model during training and adjusting the hyper-parameters to optimize the model.
Through the steps, a rule generation model capable of extracting features from data and classifying or identifying the features can be constructed, so that a data conversion rule is generated.
Furthermore, the step can also comprise the steps of storing the generated rules to construct a rule base, wherein the rule base can support the subsequent multiplexing and optimization and has the functions of dynamic updating, version control, rule inquiry and the like.
The structural design of rule base typically includes rule ID, rule content, creation time, version information, and frequency of use. The rule ID is an ID uniquely identifying the rule, the rule content is a specific rule description including conditions and results, the creation time is the time of rule generation, the version information is the version number of the rule for version management, and the frequency of use is the number of times the rule is called to evaluate the validity of the rule.
In constructing the rule base, a relational database (e.g., mySQL, postgreSQL) or a non-relational database (e.g., mongoDB) may be selected to store the rule data. And a Redis and other caching tools are used, so that the rule query efficiency is improved. And a RESTful API interface is constructed through Flask or FastAPI, so that other systems can conveniently call the rule base.
In this embodiment, the Flask framework and the SQLite database used construct a rule base, which specifically includes a database model defining the rule base, a definition rule creation and query interface, an addition and query interface of the creation rule, and the like, so as to implement creation, query, and deletion of the rule.
S4, acquiring a real-time data stream, optimizing the rule based on a trained rule optimization model, and adaptively adjusting the rule according to real-time monitoring data,
S401, acquiring real-time data information to monitor a data stream and acquiring changed data characteristics;
By receiving a real-time data stream using Kafka and processing and analyzing the real-time data stream based on APACHE FLINK or SPARK STREAMING, data stream characteristics are obtained and the characteristic distribution is analyzed periodically to identify potential pattern changes, and the monitored data characteristics provide data support for the priority or scope of application of subsequent adjustment rules.
In this embodiment, processing the real-time data stream based on APACHE FLINK includes defining extraction logic for data features, keyedProcessFunction can be used to monitor changes in data features, keyedProcessFunction provides semantics for processing time and event time, and can be used to implement complex rule adjustment logic.
S402, constructing a rule optimization model through a reinforcement learning algorithm according to the acquired real-time data, optimizing the rule according to the historical data and the real-time data stream, and ensuring the validity of the rule.
Reinforcement learning is a method of how to make optimal decisions by trial and error learning, which learns how to adapt to the environment by rewards or penalties. The reinforcement learning algorithm comprises Q-learning, a Deep Q Network (DQN) and a strategy gradient method (such as REINFORCE), wherein the Q-learning is a reinforcement learning algorithm based on values and suitable for discrete states and action spaces, the Deep Q Network (DQN) is the Q-learning combined with deep learning and suitable for processing complex state spaces, and the strategy gradient method (such as REINFORCE) is a direct optimization strategy and suitable for continuous action spaces.
In particular implementations, implementation can be based on OpenAI Gym or RLlib (Ray) environment frameworks, openAI Gym provides a standardized interface for reinforcement learning environments, RLlib (Ray) is a library for distributed reinforcement learning, supporting multiple algorithms.
In this embodiment, the implementation of the automatic optimization rule based on the Q-learning algorithm includes the specific steps of,
Defining a state space, namely defining the current state of the rule, wherein the current state comprises data characteristics, processing effects and the like.
Defining an action space, defining operations that can be performed, such as adjusting parameters, priorities, or adding new rules.
And defining a reward function, namely giving rewards according to the improvement degree of the data processing effect after the rule is applied, and encouraging the system to optimize the rule. Ensuring that the algorithm gets a higher prize in the performance boost and thus learns more efficient rules.
Wherein the defined reward function is as follows:
Q(s, a) ← Q(s, a) + α * (r + γ * π - Q(s, a))
Where s is the current state, a is the current selected action, α is the learning rate, r is the current prize, γ is the discount factor, determining the importance of future prizes, and pi is the maximum expected return after the current state is transferred, i.e. the maximum possible action value of the maximum expected return for the next state.
The reward function is used to update the value of the Q table, which maps the various states (e.g., data characteristics, rule validity, etc.) that will be processed, according to the current state, action, reward, and maximum expected return for the next state. In this way, the Q-learning algorithm can gradually optimize the Q table, and according to the information in the Q table, an optimal strategy can be extracted, and the rules can be automatically adjusted and optimized to achieve better performance or adaptability, so that the selected action can obtain higher expected return, and the whole decision process is optimized.
S403, based on the real-time information flow obtained in the step S401, the rule is adaptively adjusted by defining a rule adjustment function. In particular to the preparation method of the composite material,
S4031, monitoring the state and performance of the system through Prometheus, and visualizing data through Grafana to display rule effects and system states in real time so as to help identify places needing adjustment.
S4032, self-adaptively adjusting rules by self-defining rule adjusting functions according to the monitored characteristic change of the data. Comprising the steps of (a) a step of,
Whether an adjustment rule is needed can be determined by monitoring feature changes of the input data in real time. For example, if the value of a feature suddenly changes significantly, it may indicate that the importance of the feature to the classification task has changed, and then the corresponding rule may need to be adjusted. Adjusting the corresponding rules includes dynamically adjusting their priorities based on the effect of the rules, e.g., if a rule performs well in practice and is critical to the current task, its priority may be increased, whereas if a rule performs poorly in practice or does not match the current task, its priority may be decreased.
It should be noted that, the above process of dynamically adjusting the priority of the rule or the adjusting logic of other rules according to the feature change logic may be implemented by defining a rule adjusting function, which updates the weight or other parameters of the rule according to the feature change.
In step S5, the data is processed by the adjusted rule, and the abnormality detection and correction are performed on the data processed by the rule.
S501, processing the data through the regulated rule, obtaining the data processed by the rule, constructing an abnormality detection model, carrying out abnormality detection on the data, comprising,
The streaming data processing framework is constructed, data processed by the rule are acquired based on APACHE KAFKA, and abnormal value detection is carried out by utilizing basic statistical characteristics (such as mean, variance, distribution characteristics and the like) of the data, wherein common methods comprise Z-score, box-shaped graphs and the like. Or constructing an abnormality detection model based on a machine learning algorithm, including a clustering algorithm, an isolated Forest (Isolation Forest), a Support Vector Machine (SVM), and the like. The clustering algorithm is used for identifying abnormal points far away from other data points by dividing the data into different clusters, the isolated Forest (Isolation Forest) is an algorithm based on a random tree, abnormal data in a low-density area can be effectively identified, and the Support Vector Machine (SVM) is used for separating normal data from abnormal data by searching an optimal hyperplane. It should be noted that, the construction of the anomaly detection model for anomaly detection of data belongs to the prior art, and will not be described herein.
The method comprises the step S502 of correcting the detected abnormal data, wherein common correction comprises data filling, data transformation, abnormal value deletion and the like. Wherein,
The data filling comprises two modes of mean value/median filling and prediction filling, wherein the mean value/median filling is used for filling the missing values and is suitable for numerical data, and the prediction filling can also be used for predicting the missing values by using a regression model or a K nearest neighbor algorithm, so that the original distribution of the data can be better kept.
The data transformation comprises Standardization (Normalization) and Normalization, wherein the Standardization is to convert the data into a distribution with a mean value of 0 and a standard deviation of 1, the data transformation is suitable for an algorithm which needs to meet the assumption of normal distribution, and the Normalization is to scale the data to a specific interval (such as [0,1 ]), and the data transformation method is suitable for the situation that different sizes of data need to be compared.
Deleting outliers is for extreme outliers, considering direct deletion to avoid negatively impacting model training and analysis results.
S503, cleaning and enhancing the data after abnormality detection and correction;
Further, in tasks such as image classification, text generation, and the like, particularly in Natural Language Processing (NLP) projects, data enhancement can be performed by text data cleansing and generation of a countermeasure network (GAN) to improve data quality.
Text data cleaning includes denoising, normalization, word deactivation, word vectorization, and the like. The denoising method comprises the steps of removing special characters, HTML labels and redundant spaces in a text by using a regular expression or an NLP tool (such as NLTK and spaCy), normalizing to ensure consistency by unifying formats (such as case-to-case conversion and morphological reduction) of the text, removing meaningless words (such as 'yes', 'in' and the like) by disabling a vocabulary, and converting the text into a vector form by using Word2Vec, gloVe or BERT and other technologies to facilitate subsequent processing.
The data enhancement is based on generating a challenge network (GAN), which is a deep learning model consisting of two core parts, a Generator (Generator) and a discriminator (Discriminator) for evaluating the authenticity of the synthesized data using the Generator to synthesize the data, the two being mutually challenge trained. Wherein the generation countermeasure network (GAN) generates data (e.g., category labels) according to specific conditions to ensure diversity and pertinence of the generated data, and different data enhancement policies may be employed for different data types, specifically,
Image data, namely increasing the diversity of the image data through rotation, translation, scaling, noise adding and other technologies;
text data, namely performing synonym replacement, randomly inserting/deleting words, back-translating (translating text into other languages and back into original language) and other methods on the text data to generate a new text instance;
time series data, namely generating new time series data by applying methods such as time series decomposition, disturbance and the like.
The machine learning-based data conversion method provided by the embodiment generates an automatic rule through real-time data pattern recognition, and performs rule optimization and comprehensive data detection and correction in a self-adaptive manner, so that the efficiency and accuracy of data processing are remarkably improved.
Example two
Referring to fig. 2, the present embodiment provides a machine learning-based data conversion apparatus for implementing the machine learning-based data conversion method provided in the first embodiment, including,
The original data acquisition module is used for acquiring original data and preprocessing the original data;
the feature selection module is used for carrying out feature selection on the data through a feature selection algorithm, and extracting a feature subset with the most relevant and most information content from the original data;
The rule generation module is used for acquiring the data after feature selection and generating a corresponding data conversion rule;
The rule optimization module is used for acquiring a real-time data stream, optimizing the rule based on a trained rule optimization model, and/or adaptively adjusting the rule by defining a rule adjustment function according to real-time monitoring data;
the data monitoring and correcting module is used for acquiring the data after the rule processing and carrying out abnormality detection and correction on the data after the rule processing.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above functional units and the division of the modules are illustrated, and in practical application, the above functions may be allocated to different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to complete all or part of the functions described above.
Example III
The present embodiment provides a non-transitory computer-readable storage medium having at least one instruction or at least one program stored therein, the at least one instruction or the at least one program loaded by a processor and executed the machine learning-based data conversion method in embodiment one.
Example IV
The present embodiment provides an electronic device including a processor and a non-transitory computer-readable storage medium in embodiment three of the present invention.
The above description of the embodiments is only for aiding in the understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
It should be noted that some exemplary embodiments are described as a process or a method depicted as a flowchart. Although a flowchart depicts steps as a sequential process, many of the steps may be implemented in parallel, concurrently, or with other steps. Furthermore, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Claims (10)

1. A data conversion method based on machine learning is characterized by comprising the following steps of,
Acquiring original data and preprocessing the original data;
Performing feature selection on the preprocessed data based on a feature selection algorithm;
Inputting the data after feature selection into a corresponding trained rule generation model based on the data type to generate a corresponding data conversion rule, wherein the rule generation model is respectively constructed and trained according to the data type and comprises an image data rule generation model and a text data rule generation model;
Acquiring a real-time data stream, optimizing rules based on a trained rule optimization model, and/or adaptively adjusting the rules by defining a rule adjustment function according to real-time monitoring data;
and processing the data through the regulated rule, and detecting and correcting the abnormality of the data processed by the rule.
2. The machine learning based data conversion method of claim 1, wherein the preprocessing of the data includes data source identification, data metadata extraction, data cleansing and/or data format conversion.
3. The machine learning based data conversion method of claim 1, wherein the image data rule generation model is constructed based on a convolutional neural network model and the text data rule generation model is constructed based on a recurrent neural network model.
4. The machine learning based data transformation method of claim 3, wherein the image data rule generation model is constructed based on a convolutional neural network model comprising,
Acquiring image history data, and dividing the data into a training set and a verification set;
determining a convolutional neural network CNN model structure, which comprises a convolutional layer, a pooling layer and a full-connection layer;
Determining that the label of the data is a category label, and performing single-heat coding conversion on the category label;
Training the model by using training set data, and optimizing the model by combining the loss function so as to update parameters of the model and minimize the loss function;
the performance of the model is evaluated by the validation set and the superparameters are adjusted to optimize the model, the adjusted superparameters including learning rate and/or lot size.
5. The machine learning based data transformation method of claim 3, wherein the text data rule generating model is constructed based on a recurrent neural network model comprising,
Acquiring text history data, and dividing the data into a training set and a verification set;
determining a cyclic neural network RNN model structure, wherein the cyclic neural network RNN model structure comprises a basic cyclic neural network RNN, a long-short-term memory network LSTM or a gating cyclic unit GRU;
defining the input and output of the model, wherein the input is a text sequence, and the output is the prediction of the next word or the generation of the whole sentence;
Training the model by using training set data, measuring the difference between the model prediction and the true value by combining a loss function, and updating model parameters by an optimization algorithm;
and (3) evaluating the performance of the model by using the verification set, and adjusting the super parameters of the model according to the evaluation result, wherein the super parameters comprise the learning rate, the hidden layer size and/or the batch size.
6. The machine learning based data transformation method of claim 1, wherein the rule optimization model construction process comprises,
Defining a state space, namely defining the current state of the rule, wherein the current state comprises data characteristics and/or processing effects;
Defining action space, namely defining executable operations including adjusting parameters, priorities or adding new rules;
Defining a reward function, namely giving rewards according to the improvement degree of the data processing effect after rule application, encouraging a system to optimize the rule, ensuring that the algorithm obtains higher rewards when the performance is improved, and learning more effective rules;
Wherein, the defined reward function is:
Q(s, a) ← Q(s, a) + α * (r + γ * π - Q(s, a))
Where s is the current state, a is the current selected action, α is the learning rate, r is the current prize, γ is the discount factor, determining the importance of future prizes, and pi is the maximum expected return after the current state is transferred, i.e. the maximum possible action value of the maximum expected return for the next state.
7. The machine-learning based data conversion method of claim 1, wherein adaptively adjusting the rules based on the real-time monitoring data comprises,
Monitoring the characteristic change of the data in real time, recording the characteristic distribution change and performing visual display;
receiving a real-time data stream, and providing real-time data input for rule adjustment;
processing and analyzing the real-time data stream, and checking the data integrity to obtain the data stream characteristics;
and according to the monitored characteristic change of the data, the self-defining rule adjusting function carries out self-adapting adjustment on the rule.
8. A machine learning based data conversion apparatus for implementing the machine learning based data conversion method as claimed in claim 1, comprising,
The original data acquisition module is used for acquiring original data and preprocessing the original data;
the feature selection module is used for carrying out feature selection on the data through a feature selection algorithm and extracting a feature subset from the original data;
the rule generation module is used for acquiring the data after feature selection and generating a corresponding data conversion rule;
The rule optimization module is used for defining a reward function based on a reinforcement learning algorithm to optimize the rule and adaptively adjusting the rule according to real-time monitoring data;
the data monitoring and correcting module is used for acquiring the data after the rule processing and carrying out abnormality detection and correction on the data after the rule processing.
9. A non-transitory computer readable storage medium having at least one instruction or at least one program stored therein, wherein the at least one instruction or the at least one program is loaded and executed by a processor to implement the machine learning based data conversion method of any one of claims 1-7.
10. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 9.
CN202510080610.5A 2025-01-20 2025-01-20 A data conversion method, device, medium and equipment based on machine learning Pending CN119513503A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510080610.5A CN119513503A (en) 2025-01-20 2025-01-20 A data conversion method, device, medium and equipment based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510080610.5A CN119513503A (en) 2025-01-20 2025-01-20 A data conversion method, device, medium and equipment based on machine learning

Publications (1)

Publication Number Publication Date
CN119513503A true CN119513503A (en) 2025-02-25

Family

ID=94653857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510080610.5A Pending CN119513503A (en) 2025-01-20 2025-01-20 A data conversion method, device, medium and equipment based on machine learning

Country Status (1)

Country Link
CN (1) CN119513503A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119781535A (en) * 2025-03-11 2025-04-08 浙江万能弹簧机械有限公司 A wire cutting machine feed speed control system and method
CN120263558A (en) * 2025-06-05 2025-07-04 江苏君立华域信息安全技术股份有限公司 An adaptive distributed network threat detection and response system and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114564586A (en) * 2022-03-04 2022-05-31 中信银行股份有限公司 A method and system for identifying unstructured sensitive data
CN117610661A (en) * 2023-12-14 2024-02-27 北京科东电力控制系统有限责任公司 Power dispatching operation and maintenance operation ticket conversion rule generation method and system
CN117709446A (en) * 2023-12-14 2024-03-15 中证鹏元资信评估股份有限公司 Method for constructing dynamic financial credit risk model based on rule engine
CN118536609A (en) * 2024-06-05 2024-08-23 武汉中地大智慧城市研究院有限公司 Intelligent rule engine configuration method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114564586A (en) * 2022-03-04 2022-05-31 中信银行股份有限公司 A method and system for identifying unstructured sensitive data
CN117610661A (en) * 2023-12-14 2024-02-27 北京科东电力控制系统有限责任公司 Power dispatching operation and maintenance operation ticket conversion rule generation method and system
CN117709446A (en) * 2023-12-14 2024-03-15 中证鹏元资信评估股份有限公司 Method for constructing dynamic financial credit risk model based on rule engine
CN118536609A (en) * 2024-06-05 2024-08-23 武汉中地大智慧城市研究院有限公司 Intelligent rule engine configuration method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119781535A (en) * 2025-03-11 2025-04-08 浙江万能弹簧机械有限公司 A wire cutting machine feed speed control system and method
CN120263558A (en) * 2025-06-05 2025-07-04 江苏君立华域信息安全技术股份有限公司 An adaptive distributed network threat detection and response system and method

Similar Documents

Publication Publication Date Title
US10025813B1 (en) Distributed data transformation system
US10360517B2 (en) Distributed hyperparameter tuning system for machine learning
CN117669984B (en) Workshop scheduling method based on reinforcement learning of digital twin and knowledge graph
Zhang et al. Deep learning-driven data curation and model interpretation for smart manufacturing
CN119067689B (en) Production tracing and transaction collaboration method and system of MES system
CN119513503A (en) A data conversion method, device, medium and equipment based on machine learning
US20200285984A1 (en) System and method for generating a predictive model
CN119441993A (en) Transformer life prediction method and system based on machine learning and Internet of Things
CN117495109B (en) A neural network-based electricity stealing user identification system
Yan et al. Big-data-driven based intelligent prognostics scheme in industry 4.0 environment
CN119597834B (en) Unstructured data automatic processing method and system based on deep learning
CN117235444A (en) Financial wind control method and system integrating deep learning and expert experience
CN118194487A (en) Automatic arrangement method, medium and system for circuit and electric equipment
CN119580022A (en) Method and system for analyzing defects in wafer manufacturing based on big data
CN113268370A (en) Root cause alarm analysis method, system, equipment and storage medium
CN112949825B (en) Resource adjustment method, device and equipment
CN119005087B (en) Automatic optimization method and system for PCB board splitting path based on machine learning
CN119202545A (en) Monitoring method and system for data governance process
CN119599946A (en) A road crack detection method, device and electronic equipment
CN119107047A (en) Agent-based decision-making support system and method for water transport engineering project management
EP4339845A1 (en) Method, apparatus and electronic device for detecting data anomalies, and readable storage medium
US20240289609A1 (en) System for training neural network to detect anomalies in event data
CN116757338A (en) Crop yield prediction method, device, electronic equipment and storage medium
CN119579602B (en) Defect detection system, method and device based on artificial intelligent image processing
CN119295984B (en) An adaptive inspection method and system for unmanned aerial vehicles in the field of industrial three-defense

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination