CN116955444B

CN116955444B - Method and system for mining collected noise points based on big data analysis

Info

Publication number: CN116955444B
Application number: CN202310717597.0A
Authority: CN
Inventors: 凌晓华; 王晓宇; 章熠辉
Original assignee: Individual
Current assignee: Liu Fu
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2024-08-23
Anticipated expiration: 2043-06-15
Also published as: CN116955444A

Abstract

The invention relates to the technical field of data noise point mining, in particular to a method and a system for mining collected noise points based on big data analysis. The method comprises the following steps: carrying out data standardization processing on the cleaning big data to obtain standardized big data; performing feature extraction on the standardized big data to obtain big data features, calculating matrix variance values corresponding to each matrix in the feature matrix, determining an abnormal matrix in the feature matrix, constructing a data scatter diagram corresponding to the standardized big data, calculating distance values between each data point in the data scatter diagram, obtaining a total distance value of the data points, and determining discrete data points in the data scatter diagram; and determining noise data in the standardized big data, extracting a data signal corresponding to the noise data, calculating a signal sparse value corresponding to the data signal, and executing noise optimization processing of the data noise point to obtain a noise optimization processing result. The invention aims to improve the accuracy of mining the collected noise points based on big data analysis.

Description

Method and system for mining collected noise points based on big data analysis

Technical Field

The invention relates to the technical field of data noise point mining, in particular to a method and a system for mining collected noise points based on big data analysis.

Background

Big data analysis refers to methods, tools and applications for collecting, processing and deriving insight from a variety of large, high-speed data sets, which may come from various sources, such as Web, mobile applications, email, social media and networking smart devices, which typically represent data that is generated at high speed, in a variety of forms, from structured (database tables, excel tables) to semi-structured (XML files, web pages) to unstructured (images, audio files) should be complete, but the data during collection will interfere with the data due to malfunctions of the device or human operational errors and other acoustic disturbances, creating noise data that in turn affects subsequent data analysis, thus requiring mining of noise points for the data in order to improve the accuracy of subsequent data analysis.

However, the existing method for mining the collected noise points based on big data analysis mainly comprises the steps of extracting redundant data in the big data, marking redundant data nodes corresponding to the redundant data, identifying redundant node fields of the redundant data nodes, obtaining noise points of the big data according to the redundant node fields, however, this method only processes redundant data in data, and since noise data in large data is various in cause and type, the use of this method causes a reduction in the efficiency of mining noise points of large data, and thus a method capable of improving the efficiency of mining noise points collected based on analysis of large data is required.

Disclosure of Invention

The invention provides a method and a system for mining collected noise points based on big data analysis to solve at least one technical problem.

A method for mining collected noise points based on big data analysis comprises the following steps:

step S1: acquiring original big data to be mined and a data field, performing data cleaning on the original big data to obtain cleaning big data, and performing data standardization processing on the cleaning big data according to the data field to obtain standardized big data;

Step S2: performing feature extraction on the standardized big data to obtain big data features, constructing feature matrixes corresponding to the big data features, calculating matrix variance values corresponding to each matrix in the feature matrixes, and determining abnormal matrixes in the feature matrixes according to the matrix variance values, wherein the calculating the matrix variance values corresponding to each matrix in the feature matrixes comprises the following steps:

Calculating a matrix variance value corresponding to each matrix in the feature matrix through the following formula:

Wherein E represents a matrix variance value corresponding to each matrix in the feature matrix, a represents a matrix serial number of the feature matrix, y represents a total number of the feature matrix, G _a represents a matrix expected value of the a-th feature matrix, G _a represents a matrix value corresponding to the a-th feature matrix, Representing the average value of the feature matrix;

Step S3: performing linear conversion on the standardized big data to obtain a big data linear value, constructing a data scatter diagram corresponding to the standardized big data according to the big data linear value, calculating a distance value between each data point in the data scatter diagram to obtain a total data point distance value, and determining discrete data points in the data scatter diagram according to the total data point distance value;

step S4: and combining the abnormal matrix and the data discrete points, determining noise data in the standardized big data, extracting a data signal corresponding to the noise data, calculating a signal sparse value corresponding to the data signal, calculating a signal to noise ratio corresponding to the data signal, inquiring a data noise point in the noise data according to the signal to noise ratio and the signal sparse value, extracting a noise characteristic corresponding to the data noise point, constructing a noise optimization scheme corresponding to the data noise point according to the noise characteristic, and executing noise optimization processing of the data noise point according to the noise optimization scheme to obtain a noise optimization processing result.

In an embodiment of the present disclosure, the performing, according to the data field, data normalization processing on the cleaning big data to obtain normalized big data includes:

Dispatching historical big data in the data field, analyzing a data architecture of each data in the historical big data, and determining a data format of each data in the historical big data according to the data architecture;

Measuring the format frequency of each format in the data formats, and determining the standard format in the historical big data according to the format frequency;

Inquiring a format source code corresponding to the standard format, and formulating a format converter for cleaning big data according to the format source code;

And carrying out format standardization processing on the cleaning big data by using the format converter to obtain standardized big data.

In one embodiment of the present disclosure, the constructing a feature matrix corresponding to the big data feature includes:

performing dimension reduction processing on the big data features to obtain dimension reduction features, and performing vector conversion on the dimension reduction features to obtain feature vectors;

Calculating a feature vector value corresponding to the feature vector, and taking the feature vector value as a feature value corresponding to the big data feature;

And calculating vector similarity coefficients among the feature vectors, and constructing a feature matrix corresponding to the big data features according to the vector similarity coefficients and the feature values.

In one embodiment of the present specification, the calculating the vector similarity coefficient between the feature vectors includes:

vector similarity coefficients between the feature vectors are calculated by the following formula:

Wherein D represents a vector similarity coefficient between feature vectors, B represents a sequence number of feature vectors, B represents the number of feature vectors, A _b represents a vector length corresponding to the B-th feature vector, The average value of the lengths of all the eigenvectors is represented, and a _b+1 represents the length of the vector corresponding to the (b+1) th eigenvector.

In one embodiment of the present disclosure, the constructing a data scatter diagram corresponding to the normalized big data according to the big data linear value includes:

acquiring a data sequence of each datum in the standardized big data, and extracting variable data of each datum in the standardized big data, wherein the variable data comprises self-variable data and dependent variable data;

analyzing the variable relation between the self-variable data and the dependent variable data, and calculating variable values corresponding to the self-variable data and the dependent variable data according to the variable relation and the big data linear value to obtain a first variable value and a second variable value;

and constructing a data scatter diagram corresponding to the standardized big data according to the first variable value, the second variable value and the data sequence.

In one embodiment of the present disclosure, the calculating a distance value between each data point in the data scatter plot, to obtain a total distance value between the data points, includes:

calculating a total value of distances between each data point in the data scatter plot by the following formula:

Wherein H represents the total value of the distances between each data point in the data scatter plot, i the serial number of the data points in the data scatter plot, q represents the number of the data points in the data scatter plot, K _i-1 represents the coordinate value of the i-1 st data point in the data scatter plot, K _i represents the coordinate value of the i-th data point in the data scatter plot, and K _q represents the coordinate value of the q-th data point in the data scatter plot.

In one embodiment of the present disclosure, the calculating a signal sparseness value corresponding to the data signal includes:

identifying a data time domain signal and a data frequency domain signal in the data signals, and carrying out Fourier transform on the data time domain signals to obtain transformed data signals;

According to the transformed data signal and the data frequency domain signal, carrying out signal reconstruction on the data signal to obtain a target data signal;

And calculating signal information entropy corresponding to the target data signal, and taking the signal information entropy as a signal sparse value corresponding to the data signal.

In one embodiment of the present disclosure, the calculating the signal information entropy corresponding to the target data signal includes:

Calculating signal information entropy corresponding to the target data signal according to the following formula:

Wherein P represents the signal information entropy corresponding to the target data signal, j represents the sequence number of the time period corresponding to the target data signal, t represents the number of packets in the target data signal, N _j represents the data signal value of the jth time period in the target data signal, and M (N _j) represents the probability of occurrence of the data signal value of the jth time period in the target data signal.

In an embodiment of the present disclosure, the constructing, according to the noise characteristic, a noise optimization scheme corresponding to the data noise point includes:

Inquiring a characteristic index corresponding to the noise characteristic, and extracting an index parameter corresponding to the characteristic index;

Calculating the parameter difference between the index parameter and the standard index parameter, and determining an index to be optimized in the characteristic index according to the parameter difference;

Inquiring an optimization rule of each index in the indexes to be optimized, and constructing a noise optimization scheme corresponding to the data noise point according to the optimization rule.

An acquisition noise point mining system based on big data analysis, the system comprising:

The data processing module is used for acquiring the original big data to be mined and the data field, carrying out data cleaning on the original big data to obtain cleaning big data, and carrying out data standardization processing on the cleaning big data according to the data field to obtain standardized big data;

the matrix variance calculating module is configured to perform feature extraction on the standardized big data to obtain big data features, construct feature matrices corresponding to the big data features, calculate matrix variance values corresponding to each matrix in the feature matrices, and determine abnormal matrices in the feature matrices according to the matrix variance values, where the calculating the matrix variance values corresponding to each matrix in the feature matrices includes:

The discrete point determining module is used for carrying out linear conversion on the standardized big data to obtain a big data linear value, constructing a data scatter diagram corresponding to the standardized big data according to the big data linear value, calculating a distance value between each data point in the data scatter diagram to obtain a total data point distance value, and determining discrete data points in the data scatter diagram according to the total data point distance value;

The noise point optimizing module is used for combining the abnormal matrix and the data discrete points, determining noise data in the standardized big data, extracting data signals corresponding to the noise data, calculating signal sparse values corresponding to the data signals, calculating signal to noise ratios corresponding to the data signals, inquiring data noise points in the noise data according to the signal to noise ratios and the signal sparse values, extracting noise characteristics corresponding to the data noise points, constructing a noise optimizing scheme corresponding to the data noise points according to the noise characteristics, executing noise optimizing processing of the data noise points according to the noise optimizing scheme, and obtaining noise optimizing processing results.

According to the method, the original big data to be mined and the data field are obtained, the original big data is subjected to data cleaning, invalid data and repeated data in the original big data can be removed, the data quality of the original big data is improved, and the relevant processing technical means corresponding to the original big data can be known through the data field. According to the invention, the standardized big data can be converted into a numerical form by linearly converting the standardized big data, so that the standardized big data can be visually displayed through numerical values, and the standardized big data can be more intuitively known; according to the method, the abnormal matrix and the data discrete points are combined, so that the noise data in the standardized big data are determined, the noise data in the standardized big data can be extracted completely, further the subsequent noise point analysis of the noise data is facilitated, and further the accuracy of noise point mining can be improved; therefore, the method and the system for mining the collected noise points based on the big data analysis can improve the accuracy of mining the collected noise points based on the big data analysis.

Drawings

Other features, objects and advantages of the application will become more apparent upon reading of the detailed description of a non-limiting implementation, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart of a method for mining collected noise points based on big data analysis according to an embodiment of the present invention;

fig. 2 is a functional block diagram of an acquisition noise point mining system based on big data analysis according to an embodiment of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The following is a clear and complete description of the technical method of the present patent in conjunction with the accompanying drawings, and it is evident that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

Furthermore, the drawings are merely schematic illustrations of the present invention and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. The functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor methods and/or microcontroller methods.

It will be understood that, although the terms "first," "second," etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Referring to fig. 1, the method for mining the collected noise points based on big data analysis comprises the following steps:

According to the method, the original big data to be mined and the data field are acquired, data cleaning is carried out on the original big data, invalid data and repeated data in the original big data can be removed, the data quality of the original big data is improved, the relevant processing technical means corresponding to the original big data can be known through the data field, the original big data is required data acquired through a dispatching database or relevant equipment, noise data such as image data, video and audio data and text data are contained in the data, the data field is the type corresponding to the original big data, the data field corresponding to the image data is the image field, cleaning big data is data obtained after the original big data is subjected to repeated, invalid and abnormal data are removed, optionally, the data cleaning on the original big data can be achieved through a data cleaning tool, and the data cleaning tool is compiled by a script language.

According to the data field, the data standardization processing is carried out on the cleaning big data, and the data format in the cleaning big data can be standardized so as to facilitate the subsequent improvement of the data processing efficiency, wherein the standardized big data is the data obtained after the standardized processing of the format of the cleaning big data.

As an embodiment of the present invention, according to the data field, performing data normalization processing on the cleaning big data to obtain normalized big data, including: scheduling historical big data in the field of data, analyzing the data architecture of each data in the historical big data, determining the data format of each data in the historical big data according to the data architecture, measuring the format frequency of each format in the data format, determining the standard format in the historical big data according to the format frequency, inquiring the format source code corresponding to the standard format, formulating the format converter of the cleaning big data according to the format source code, and carrying out format standardization processing on the cleaning big data by utilizing the format converter to obtain standardized big data.

The data structure is formed by data structures corresponding to the historical big data, the data format is a format corresponding to each data in the historical big data when the data are processed by a computer, the frequency of the format is the frequency of each format of the data format, the format source code is a program code corresponding to the standard format, and the format converter is used for carrying out format standardization processing on the cleaning big data.

Optionally, the scheduling of the historical big data in the data field may be implemented by a data scheduler, the parsing of the data architecture of each data in the historical big data may be implemented by a structural analysis method, the metering of the format frequency of each format in the data format may be implemented by a scientific counting method, the querying of the format source code corresponding to the standard format may be implemented by a code querier, and the formulating of the format converter for cleaning the big data may be implemented by code programming according to the format source code.

Step S2: performing feature extraction on the standardized big data to obtain big data features, constructing feature matrixes corresponding to the big data features, calculating matrix variance values corresponding to each matrix in the feature matrixes, and determining abnormal matrixes in the feature matrixes according to the matrix variance values;

According to the method, the characteristic extraction is carried out on the standardized big data, so that the data characteristic property in the standardized big data can be obtained, the data characteristic of the standardized big data can be expressed more intuitively, and the construction of a subsequent characteristic matrix is facilitated, wherein the big data characteristic is the characteristic of each data in the standardized big data, and the characteristic can be realized by a principal component analysis method optionally.

The invention is convenient for converting the big data feature from abstract expression to numerical expression, thereby facilitating subsequent calculation of the feature variance value, wherein the feature matrix is an aggregate square matrix corresponding to the big data feature.

As an embodiment of the present invention, the constructing the feature matrix corresponding to the big data feature includes: performing dimension reduction processing on the big data features to obtain dimension reduction features, performing vector conversion on the dimension reduction features to obtain feature vectors, calculating feature vector values corresponding to the feature vectors, taking the feature vector values as feature values corresponding to the big data features, calculating vector similarity coefficients among the feature vectors, and constructing feature matrixes corresponding to the big data features according to the vector similarity coefficients and the feature values.

The dimension reduction feature is that features in the big data feature are reduced from high dimension to low dimension, the features in the big data feature can be converted into the same dimension so as to facilitate subsequent processing, the feature vector is a vector expression form corresponding to the dimension reduction feature, the feature vector value is a numerical value corresponding to the feature vector, and the vector similarity coefficient represents the vector similarity degree between the feature vectors.

Optionally, the dimension reduction processing on the big data feature may be implemented by a PCA linear dimension reduction method, the vector conversion on the dimension reduction feature may be implemented by a Word2vec algorithm, the calculation of the feature vector value corresponding to the feature vector may be implemented by a vector algorithm, such as addition and subtraction of a vector, and the like, the feature value is divided according to the magnitude of the value of the vector similarity coefficient, and then a feature matrix corresponding to the big data feature is constructed according to a matrix construction function, where the matrix construction function includes a zero function.

Further, as an optional embodiment of the present invention, the calculating a vector similarity coefficient between the feature vectors includes:

According to the method, the matrix variance value corresponding to each matrix in the feature matrix is calculated, and the deviation degree of each matrix in the feature matrix can be known through the matrix variance value, so that the abnormal matrix in the feature matrix can be determined conveniently, wherein the matrix variance value represents the deviation degree between each matrix in the feature matrix.

As one embodiment of the present invention, the calculating a matrix variance value corresponding to each matrix in the feature matrix includes:

Wherein E represents a matrix variance value corresponding to each matrix in the feature matrix, a represents a matrix serial number of the feature matrix, y represents a total number of the feature matrix, G _a represents a matrix expected value of the a-th feature matrix, G _a represents a matrix value corresponding to the a-th feature matrix, Representing the average value of the feature matrix.

According to the method, the abnormal matrix in the feature matrix is determined according to the matrix variance value, so that a matrix with a far deviation in the feature matrix can be obtained, the accuracy of the follow-up determination of noise data is improved, wherein the abnormal matrix is a matrix with a high deviation degree in the feature matrix, optionally, the matrix variance value is compared with a preset variance value, if the matrix variance value is larger than the preset variance value, the feature matrix corresponding to the matrix variance value is used as the abnormal matrix, and the preset variance value is a criterion of the matrix variance value, can be 0.8, and can be set according to an actual service scene.

according to the invention, the standardized big data can be converted into a numerical form by linearly converting the standardized big data, so that the standardized big data can be visually displayed through the numerical value, and the standardized big data can be more intuitively known, wherein the big data linear value represents the numerical value corresponding to the standardized big data, and optionally, the linear conversion of the standardized big data can be realized through a linear function, such as a linear function.

According to the method, the data scatter diagram corresponding to the standardized big data is constructed according to the big data linear value, the distribution condition of the standardized big data can be known through the data scatter diagram, and the subsequent discrete data point determination is facilitated, wherein the data scatter diagram is a visual diagram corresponding to the standardized big data.

As one embodiment of the present invention, the constructing a data scatter diagram corresponding to the standardized big data according to the big data linear value includes: the method comprises the steps of obtaining a data sequence of each datum in standardized big data, extracting variable data of each datum in the standardized big data, analyzing variable relations between the self-variable data and dependent variable data, calculating variable values corresponding to the self-variable data and the dependent variable data according to the variable relations and the big data linear values, obtaining a first variable value and a second variable value, and constructing a data scatter diagram corresponding to the standardized big data according to the first variable value, the second variable value and the data sequence.

The data sequence is a sequence number of each datum in the standardized big data, the self-variable data is independent variable data in the standardized big data, the dependent variable data is dependent variable data in the standardized big data, a dependent object is the self-variable data, and the first variable value and the second variable value respectively represent linear values corresponding to the self-variable data and the dependent variable data.

Optionally, the data sequence of each data in the standardized big data may be obtained through an SQL query statement, the variable data of each data in the standardized big data may be extracted through a stepwise regression method, the variable relationship between the self-variable data and the dependent variable data may be analyzed through a regression analysis method, a variable relationship coefficient may be determined according to the variable relationship, the variable values corresponding to the self-variable data and the dependent variable data may be calculated by combining the big data linear value and the variable relationship coefficient, and the data scatter diagram corresponding to the standardized big data may be constructed through a EdrawMax tool.

According to the method, the distance value between each data point in the data scatter diagram is calculated to obtain the total distance value of the data points, so that the distance between each data point in the data scatter diagram can be known, and further the degree of dispersion between each data point in the data scatter diagram can be judged, wherein the total distance value of the data points represents the distance between each data point in the data scatter diagram.

As one embodiment of the present invention, the calculating a distance value between each data point in the data scatter diagram to obtain a total distance value of the data points includes:

According to the method, the discrete data points in the data scatter diagram are determined according to the data point distance total value, the data points with the longer distances in the data scatter diagram can be obtained, and the discrete data in the standardized big data can be obtained, wherein the discrete data points are the data points with the longer distances in the data scatter diagram, optionally, the data point distance total value can be compared with a preset distance value, and if the data point distance total value is larger than the preset distance value, the data points corresponding to the data point distance total value are taken as the discrete data points in the data scatter diagram.

Step S4: determining noise data in the standardized big data by combining the abnormal matrix and the data discrete points, extracting a data signal corresponding to the noise data, calculating a signal sparse value corresponding to the data signal, calculating a signal to noise ratio corresponding to the data signal, inquiring a data noise point in the noise data according to the signal to noise ratio and the signal sparse value, extracting a noise characteristic corresponding to the data noise point, constructing a noise optimization scheme corresponding to the data noise point according to the noise characteristic, executing noise optimization processing of the data noise point according to the noise optimization scheme, and obtaining a noise optimization processing result;

According to the method, the noise data in the standardized big data are determined by combining the abnormal matrix and the data discrete points, so that the noise data in the standardized big data can be extracted completely, further the subsequent noise point analysis of the noise data is facilitated, and further the accuracy of noise point mining can be improved, wherein the noise data are data influenced by interference in the standardized big data.

The invention can obtain the electric signal expression form of the noise data by extracting the data signal corresponding to the noise data so as to facilitate the calculation of the subsequent signal sparse value, wherein the data signal is an electric signal carrying information in the noise data, and optionally, the extraction of the data signal corresponding to the noise data can be realized by a signal collector.

According to the method and the device, the signal sparseness value corresponding to the data signal is calculated, so that the sparseness degree of the data signal can be known, and further the signal noise part in the data signal can be conveniently known, wherein the signal sparseness value represents the sparseness degree of the data signal.

As one embodiment of the present invention, the calculating a signal sparseness value corresponding to the data signal includes: and identifying a data time domain signal and a data frequency domain signal in the data signals, carrying out Fourier transform on the data time domain signals to obtain transformed data signals, carrying out signal reconstruction on the data signals according to the transformed data signals and the data frequency domain signals to obtain target data signals, calculating signal information entropy corresponding to the target data signals, and taking the signal information entropy as a signal sparse value corresponding to the data signals.

The data time domain signal is a signal transformed with time in the data signal, the data frequency domain signal is a signal corresponding to the data signal in a frequency domain, the transformed data signal is a frequency domain signal obtained by fourier transforming the data time domain signal, the target data signal is a signal reconstructed by the transformed data signal and the data frequency domain signal, and the signal information entropy is a probability representing occurrence of signals in each frequency domain in the target data signal, so that the separation degree of the signals can be known, and further the signal sparsity of the data signal can be judged.

Optionally, identifying the data time domain signal and the data frequency domain signal in the data signal may be implemented by a MATLAB tool, performing fourier transform on the data time domain signal may be implemented by a fourier transform algorithm, and performing signal reconstruction on the data signal may be implemented by a signal reconstruction algorithm.

As an optional embodiment of the present invention, the calculating signal information entropy corresponding to the target data signal includes:

The invention can know the ratio of the intensity of the useful signal to the intensity of the interference signal in the data signal by calculating the signal-to-noise ratio corresponding to the data signal, thereby facilitating the subsequent inquiry of noise points in the noise data, wherein the signal-to-noise ratio is the ratio of the energy of the useful signal to the energy of the interference signal in the data signal, and further, the signal-to-noise ratio corresponding to the data signal can be obtained by calculating the ratio of the energy of the useful signal to the energy of the interference signal in the data signal.

According to the invention, the data noise points in the noise data are inquired according to the signal-to-noise ratio and the signal sparse value, so that the accurate noise points generated by the noise data can be obtained, and a related noise optimization scheme can be formulated conveniently, wherein the data noise points are specific reasons for the generation of the noise data, such as sound interference or equipment faults, and the like, optionally, the data noise points in the noise data can be inquired from a preset noise point table according to the signal-to-noise ratio and the signal sparse value, and the preset noise point table is a table formed by analyzing a large number of historical noise points, the signal-to-noise ratio and the mapping relation between the signal sparse value.

According to the method, the noise characteristics corresponding to the data noise points are extracted, and a noise optimization scheme corresponding to the data noise points is constructed according to the noise characteristics so as to remove the data noise points and improve the quality of the standardized big data, wherein the noise optimization scheme is a method for removing the data noise points, and optionally, the noise characteristics corresponding to the data noise points are extracted through a power spectrum method.

As an embodiment of the present invention, the constructing a noise optimization scheme corresponding to the data noise point according to the noise characteristic includes: inquiring the characteristic indexes corresponding to the noise characteristics, extracting index parameters corresponding to the characteristic indexes, calculating parameter differences between the index parameters and standard index parameters, determining indexes to be optimized in the characteristic indexes according to the parameter differences, inquiring the optimization rule of each index in the indexes to be optimized, and constructing a noise optimization scheme corresponding to the data noise points according to the optimization rule.

The characteristic indexes are characteristic items in the noise characteristic, the index parameters are data of each index in the characteristic indexes, the parameter difference is a difference value or a gap between the index parameters and the standard index parameters, the index to be optimized is an index to be optimized in the characteristic indexes, the optimization rule is an optimization method of each index in the index to be optimized, optionally, the characteristic index corresponding to the noise characteristic is queried through a Match function, the index parameter corresponding to the characteristic index is extracted through a lef function, the optimization rule of each index in the index to be optimized is queried from the Internet through a man-machine interaction mode, and the noise optimization scheme corresponding to the data noise point is obtained through combining the optimization rules.

According to the noise optimization scheme, the noise optimization processing of the data noise points is executed so as to improve the data quality of the standardized big data, wherein the noise optimization processing result is obtained after the data noise points are subjected to the optimization processing.

The system 100 for mining the collected noise points based on big data analysis can be installed in electronic equipment. Depending on the functions implemented, the system 100 for mining collected noise points based on big data analysis may include a data processing module 101, a matrix variance calculation module 102, a discrete point determination module 103, and a noise point optimization module 104. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.

In the present embodiment, the functions concerning the respective modules/units are as follows:

The data processing module 101 is configured to obtain raw big data to be mined and a data field, perform data cleaning on the raw big data to obtain cleaned big data, and perform data standardization processing on the cleaned big data according to the data field to obtain standardized big data;

The matrix variance calculating module 102 is configured to perform feature extraction on the standardized big data to obtain big data features, construct feature matrices corresponding to the big data features, calculate matrix variance values corresponding to each matrix in the feature matrices, and determine abnormal matrices in the feature matrices according to the matrix variance values, where the calculating the matrix variance values corresponding to each matrix in the feature matrices includes:

The discrete point determining module 103 is configured to perform linear conversion on the standardized big data to obtain a big data linear value, construct a data scatter diagram corresponding to the standardized big data according to the big data linear value, calculate a distance value between each data point in the data scatter diagram to obtain a total data point distance value, and determine a discrete data point in the data scatter diagram according to the total data point distance value;

The noise point optimizing module 104 is configured to combine the anomaly matrix and the data discrete points, determine noise data in the standardized big data, extract a data signal corresponding to the noise data, calculate a signal sparse value corresponding to the data signal, calculate a signal-to-noise ratio corresponding to the data signal, query a data noise point in the noise data according to the signal-to-noise ratio and the signal sparse value, extract a noise characteristic corresponding to the data noise point, construct a noise optimizing scheme corresponding to the data noise point according to the noise characteristic, and execute noise optimizing processing of the data noise point according to the noise optimizing scheme to obtain a noise optimizing processing result.

In detail, each module in the big data analysis-based acquisition noise point mining system 100 in the embodiment of the present application adopts the same technical means as the big data analysis-based acquisition noise point mining method described in fig. 1, and can produce the same technical effects, which are not described herein.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The mining method for the collected noise points based on big data analysis is characterized by comprising the following steps of:

Wherein, Representing matrix variance values corresponding to each matrix in the feature matrix, a representing matrix serial numbers of the feature matrix, y representing total number of the feature matrix,Representing the matrix expectation of the a-th feature matrix,Representing the matrix value corresponding to the a-th feature matrix,Representing the average value of the feature matrix;

The constructing the feature matrix corresponding to the big data feature comprises the following steps:

Calculating vector similarity coefficients among the feature vectors, and constructing a feature matrix corresponding to the big data features according to the vector similarity coefficients and the feature values;

Wherein said calculating vector similarity coefficients between said feature vectors comprises:

Wherein, Representing the vector similarity coefficients between the feature vectors,A sequence number representing a feature vector,The number of feature vectors is represented,Representing the vector length corresponding to the b-th feature vector,Representing the length average of all of the vectors in the feature vector,Representing the vector length corresponding to the (b+1) th feature vector;

The calculating the distance value between each data point in the data scatter diagram to obtain the total distance value of the data points comprises the following steps:

Wherein, Representing the total value of the distance between each data point in the data scatter plot, i the sequence number of the data point in the data scatter plot,Representing the number of data points in the data scatter plot,Coordinate values representing the i-1 st data point in the data scatter plotCoordinate values representing the ith data point in the data scatter plot,Coordinate values representing the q-th data point in the data scatter plot;

Step S4: determining noise data in the standardized big data by combining the abnormal matrix and the discrete data points, extracting a data signal corresponding to the noise data, calculating a signal sparse value corresponding to the data signal, calculating a signal to noise ratio corresponding to the data signal, inquiring a data noise point in the noise data according to the signal to noise ratio and the signal sparse value, extracting a noise characteristic corresponding to the data noise point, constructing a noise optimization scheme corresponding to the data noise point according to the noise characteristic, executing noise optimization processing of the data noise point according to the noise optimization scheme, and obtaining a noise optimization processing result;

The calculating the signal sparse value corresponding to the data signal includes:

Calculating signal information entropy corresponding to the target data signal, and taking the signal information entropy as a signal sparse value corresponding to the data signal;

The calculating the signal information entropy corresponding to the target data signal includes:

Wherein, Represents the signal information entropy corresponding to the target data signal, j represents the sequence number of the time period corresponding to the target data signal, t represents the number of packets in the target data signal,A data signal value representing a j-th period in the target data signal,Representing the probability of occurrence of the data signal value for the j-th time period in the target data signal.

2. The method for mining collected noise points based on big data analysis according to claim 1, wherein the step of performing data normalization processing on the cleaned big data according to the data field to obtain normalized big data comprises the steps of:

3. The method for mining collected noise points based on big data analysis according to claim 1, wherein the constructing the data scatter diagram corresponding to the standardized big data according to the big data linear value includes:

4. The method for mining collected noise points based on big data analysis according to claim 1, wherein the constructing a noise optimization scheme corresponding to the data noise points according to the noise characteristics comprises:

5. A big data analysis based acquisition noise point mining system for performing the big data analysis based acquisition noise point mining method of any of claims 1-4, the system comprising:

The noise point optimization module is used for combining the abnormal matrix and the discrete data points, determining noise data in the standardized big data, extracting a data signal corresponding to the noise data, calculating a signal sparse value corresponding to the data signal, calculating a signal to noise ratio corresponding to the data signal, inquiring a data noise point in the noise data according to the signal to noise ratio and the signal sparse value, extracting a noise characteristic corresponding to the data noise point, constructing a noise optimization scheme corresponding to the data noise point according to the noise characteristic, and executing noise optimization processing of the data noise point according to the noise optimization scheme to obtain a noise optimization processing result;