Disclosure of Invention
The present invention has been made in view of the above-described problems.
Therefore, the invention solves the technical problems that the existing data management method has non-uniform standard, low data quality inspection efficiency, incapability of dynamically adapting to the data environment and realization of complete closed loop of data management.
The technical scheme includes that an industrial data dynamic arrangement quality detection method comprises the steps of making industrial data standards, conducting falling standard metadata detection, collecting city data to an aggregation library according to different data types and integration modes, configuring a data quality model and quality inspection rules, automatically generating a mapping relation between the data standards and the data model, configuring and executing a quality inspection scheme to conduct data quality inspection, distributing inspection data, extracting correct data to the standard library, marking data labels to form data assets, and enabling the data to flow back to a data source to form a complete closed loop of data management.
The method comprises the steps of establishing a mapping relation between data standards and metadata through an intelligent mapping function according to a business classification establishing technology, business, management and quality four-class data standard and a word root library, when one standard is recommended for different metadata, matching is carried out when two or more metadata with the same Chinese name appear, matching rules comprise matching the metadata with the same Chinese name, english names are selected from the matched metadata, the English names of the first metadata are selected as uniform English names, the rule of exclusion comprises that matching is not carried out when the condition that English names are the same but no word root exists, when matching is carried out on word roots Chinese and word roots, the matching is carried out on the metadata through Chinese word roots and synonyms of the matching metadata, when the number of the matched metadata is more than or equal to 2, the matching is considered to be carried out on the matched metadata, the matching is carried out directly by adopting the combination of word roots, the English names are selected to be used as uniform English names, and when the number of the matching is not equal to 2, and when the matching is carried out on the English names, the English names are not considered to be matched, and the English names are not matched, and the matching is not considered to be the Chinese words are compared, and the matching is carried out.
The invention is used as a preferable scheme of the industrial data dynamic arrangement quality detection method, wherein the establishment of industrial data standards further comprises the steps of analyzing the metadata of the first 10 of the comprehensive key degrees, matching the metadata with the highest key degree, and calculating the key degree to be expressed as:
,
Wherein, Representing a criticality evaluation function,Representing the matching degree calculation function,Represents a matching root number calculation function,Representing a relevance calculating function, wherein the matching calculating function is used for measuring the similarity degree between two data objects and is expressed as follows:
,
Wherein, The number of root characters is matched for Chinese,For the number of Chinese characters,For the number of English matching root characters,Checking the meta data of the submerged buoy after the data standard is prepared, evaluating the meta data based on the number of the matched word roots and the association degree, verifying the quality of the data, and calculating the function of the number of the matched word roots to be expressed as:
,
Wherein, AndThe minimum and maximum matching numbers of Chinese respectively,AndThe minimum and maximum number of matches in english respectively, the relevance calculating function is expressed as:
,
Wherein, For the number of correlations to be used,AndThe minimum and maximum association numbers, respectively.
The method comprises the steps of establishing a data quality model and a data quality rule base according to data characteristics, automatically generating quality inspection rules from data standards, wherein the quality inspection rules comprise null value inspection, value range inspection, standard inspection, repeatability inspection, timeliness inspection, reference value inspection and logic inspection quality inspection rules, the quality inspection scheme comprises freely combining the quality inspection rules, setting data period, data labels and operation strategies executed by the scheme, inputting data to be detected into classifier models trained in advance, classifying the input data by each classifier model, evaluating the accuracy rate and recall rate of different classifiers on a data set by using a cross-validation mode, and taking the average value of the accuracy rate and the recall rate of the different classifiers as the health state evaluation score of training data.
As a preferable scheme of the industrial data dynamic arrangement quality detection method, the method for checking the data quality further comprises the steps of constructing a data set containing noise labelsExpressed as:
,
Wherein, Represent the firstThe characteristics of the industrial data of the individual samples,Represent the firstNoise signatures of the individual samples are used to determine,Representing the total number of samples, for noise dataPredicting to obtain the noise prediction probability of each sample, wherein the noise prediction probability is expressed as follows:
,
Wherein, Representing a given modelLower, the firstIndividual samplesNoise label of (a)Is used to determine the prediction probability of (1),Is a conditional probability expressed in a modelGiven a sampleTime noise labelBased on the prediction probability, calculating a noise labelAnd a genuine labelIs expressed as:,
Wherein, Representing confidence joint matrix, representing noise labelAnd a genuine labelIs used to determine the degree of confidence in the joint,Representing DIRAC DELTA functions whenThe value is 1 when the time is taken, otherwise, the time is 0,Represent the firstA plurality of real tags;
The tag combinations are counted and normalized to obtain a joint probability distribution, expressed as:
,
Wherein, Representing the normalized joint probability matrix,The confidence sum of all the real labels in the confidence joint matrix is represented and used for standardization;
by utilizing Cleanlab pruning function, the confidence joint matrix is used And normalized count dataPerforming data pruning, removing noise data, obtaining clean data, calculating the confidence coefficient of each sample, performing data pruning according to the confidence coefficient, and representing as follows:
,
Wherein, Represent the firstThe confidence level of the individual samples,Indicating that all samples are in noise tagAnd (3) setting a pruning threshold value according to the sum of the prediction probabilities below, and removing samples lower than the threshold value according to the confidence coefficient to obtain clean data, wherein the clean data is expressed as:
,
Wherein, A set of cleaning data is represented and,Indicating the set clipping threshold.
The method for detecting the dynamic arrangement quality of the industrial data comprises the steps of dividing audit data, archiving abnormal data into an error data table, triggering a data quality correction flow to generate a data quality correction work order, extracting correct data into a standard library, and marking a data label to form the data asset.
The invention is used as a preferable scheme of the industrial data dynamic arrangement quality detection method, wherein the forming of the complete closed loop of data management comprises the steps of creating an error data archive and a quality detection report, analyzing quality problems in the report, separating various data from a problem data set, forming different problem data worksheets according to data sources, pushing the different problem data worksheets to a data source system, tracking the restoration of source data quality problems, entering the aggregation library from a front-end processor again through data aggregation after the restoration is completed, and carrying out repeated data quality management.
Another object of the present invention is to provide an industrial data dynamic arrangement quality detection system, which can automatically generate a mapping relation between a data standard and a data model by configuring a data quality model and quality inspection rules, configure and execute a quality inspection scheme to perform data quality inspection, so as to solve the problem of low accuracy in quality inspection of the existing industrial data.
The industrial data dynamic arrangement quality detection system comprises a data standardization module, a quality auditing module and a data distribution module, wherein the data standardization module is used for making industrial data standards, checking standard falling metadata, collecting all city data to a collection base according to different data types and integration modes, the quality auditing module is used for configuring a data quality model and quality checking rules, automatically generating a mapping relation between the data standards and the data model, configuring and executing a quality checking scheme, and conducting data quality auditing, and the data distribution module is used for distributing auditing data, extracting correct data to a standard base, marking data labels, forming data assets, and enabling the data to flow back to a data source to form a complete data closed loop.
A computer device comprising a memory storing a computer program and a processor executing the computer program is a step of implementing a method for dynamic orchestration quality detection of industrial data.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a method for dynamic orchestration quality detection of industrial data.
The industrial data dynamic arrangement quality detection method has the beneficial effects that the industrial data dynamic arrangement quality detection method provided by the invention is used for acquiring all city data to the collection library according to different data types and integration modes by formulating industrial data standards and checking falling standard metadata, so that the uniformity and consistency of the data standards are ensured, the data integration and summarization are effectively carried out, a complete data resource library is formed, the standardization and consistency of the data are ensured, the data acquisition efficiency and accuracy are improved, the data is ensured to be subjected to preliminary standardized processing before entering the collection library, and the problem of inconsistent formats and types possibly occurring in the data integration process is reduced; the quality inspection scheme is configured and executed to perform data quality inspection, so that the automatic inspection and evaluation of the data quality are realized, the quality consistency of the data in different processing links is ensured, the data is subjected to comprehensive quality inspection, the quality problem in the data is found and marked, the efficiency and accuracy of the data quality inspection are improved, the workload and error of manual inspection are reduced, the quality inspection scheme is rapidly configured and executed by automatically generating the quality inspection rule and the mapping relation, the automation level of the data management is improved, the correct data is extracted to a standard library by shunting the inspection data, the data label is marked, the data asset is formed, the complete closed loop of the data management is formed, the refined management and the efficient utilization of the data are realized, the quick retrieval and classification of the data are realized, the data utilization rate and the management efficiency are improved, the data quality and consistency in the data management process are ensured, the invention has better effects in the aspects of accuracy, efficiency and reliability by forming closed-loop management through data backflow, establishing a continuously improved data management mechanism and ensuring long-term stability and optimization of data quality.
Detailed Description
So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Embodiment 1, referring to fig. 1, for an embodiment of the present invention, there is provided a method for detecting dynamic arrangement quality of industrial data, including:
S1, making an industrial data standard, checking falling mark metadata, and collecting city data to a collection library according to different data types and integration modes.
Further, the method comprises the steps of establishing an industrial data standard comprising four data standards of service classification and service, management and quality and a word root library, establishing a mapping relation between the data standard and metadata through an intelligent mapping function, when recommending one standard for different metadata, matching if two or more metadata with the same Chinese name appear, matching the metadata with the same Chinese name, selecting English names from the matched metadata, selecting the English name of the first metadata as a unified English name, excluding conditions comprising that the English names are identical but do not contain any word root, performing preliminary screening on the metadata through Chinese word roots and synonyms of the matched metadata when matching word roots Chinese and synonyms, considering that the selected metadata is matched when the number of the matched metadata is more than or equal to 2, directly adopting English combinations of the word roots for the matched metadata, separating word roots by using English names, and performing preliminary screening on the English word roots by using the English names and the metadata with the abbreviation of the matched metadata when matching word roots and abbreviation of the matched metadata are not matched, and the English names of the English names are not matched to be considered to be equal to 2.
It should be noted that, the establishment of the industrial data standard further includes the analysis of the metadata of the first 10 of the comprehensive criticality, the matching of the metadata with the highest criticality is performed, and the calculation criticality is expressed as:
,
Wherein, Representing a criticality evaluation function,Representing the matching degree calculation function,Represents a matching root number calculation function,Representing a relevance calculating function, wherein the matching calculating function is used for measuring the similarity degree between two data objects and is expressed as follows:
,
Wherein, The number of root characters is matched for Chinese,For the number of Chinese characters,For the number of English matching root characters,Checking the meta data of the submerged buoy after the data standard is prepared, evaluating the meta data based on the number of the matched word roots and the association degree, verifying the quality of the data, and calculating the function of the number of the matched word roots to be expressed as:
,
Wherein, AndThe minimum and maximum matching numbers of Chinese respectively,AndThe minimum and maximum number of matches in english respectively, the relevance calculating function is expressed as:
,
Wherein, For the number of correlations to be used,AndThe minimum and maximum association numbers, respectively.
It should also be noted that, by formulating uniform data standards, the problem of inconsistent data standards is solved, the standardization and consistency of data are ensured, the data standards in the prior art are not uniform, the data integration difficulty is high, the data quality is uneven, by formulating industrial data standards and performing drop metadata inspection, the problem can be effectively solved, the efficiency of data acquisition and integration is improved, the standardization and consistency of data are ensured, a solid foundation is provided for subsequent data processing and analysis, by formulating industrial data standards, the uniformity and compatibility of various data between different cities and different systems are ensured, the efficiency of data integration is improved, the complexity of data format conversion is reduced, the requirements of classification database construction are graded according to the purpose of data use, unified planning resources are met, the organization and mining of standard unification and flow specification are performed on data resources, urban infrastructure data, monitoring data and supervision data are integrated, a platform-based modeling tool is used for constructing an urban security operation center containing a collection base, a main question base, a special question base, supporting data, a decision-making department is provided for supporting data, the data and a common operation guide system is improved, the data efficiency is improved, the data is shared between the data and the unified, and the data is shared, and the data is more efficient and the data is better used.
And S2, configuring a data quality model and quality inspection rules, automatically generating a mapping relation between a data standard and the data model, configuring and executing a quality inspection scheme, and auditing the data quality.
Furthermore, the data quality auditing comprises the steps of establishing a data quality model and a data quality rule base according to data characteristics, automatically generating quality inspection rules from data standards, wherein the quality inspection rules comprise blank value inspection, value domain inspection, standard inspection, repeatability inspection, timeliness inspection, reference value inspection and logic inspection, the quality inspection scheme comprises the steps of freely combining the quality inspection rules, setting a data period, a data label and an operation strategy executed by the scheme, inputting data to be detected into classifier models trained in advance, classifying the input data by each classifier model, and evaluating the accuracy and recall of different classifiers on a data set by using a cross-validation mode, wherein the average value of the accuracy and recall of the different classifiers is used as the health state evaluation score of training data.
It should be noted that the data quality auditing also includes constructing a data set containing noise labelsExpressed as:
,
Wherein, Represent the firstThe characteristics of the industrial data of the individual samples,Represent the firstNoise signatures of the individual samples are used to determine,Representing the total number of samples, for noise dataPredicting to obtain the noise prediction probability of each sample, wherein the noise prediction probability is expressed as follows:
,
Wherein, Representing a given modelLower, the firstIndividual samplesNoise label of (a)Is used to determine the prediction probability of (1),Is a conditional probability expressed in a modelGiven a sampleTime noise labelBased on the prediction probability, calculating a noise labelAnd a genuine labelIs expressed as:,
Wherein, Representing confidence joint matrix, representing noise labelAnd a genuine labelIs used to determine the degree of confidence in the joint,Representing DIRAC DELTA functions whenThe value is 1 when the time is taken, otherwise, the time is 0,Represent the firstA plurality of real tags;
The tag combinations are counted and normalized to obtain a joint probability distribution, expressed as:
,
Wherein, Representing the normalized joint probability matrix,The confidence sum of all the real labels in the confidence joint matrix is represented and used for standardization;
by utilizing Cleanlab pruning function, the confidence joint matrix is used And normalized count dataPerforming data pruning, removing noise data, obtaining clean data, calculating the confidence coefficient of each sample, performing data pruning according to the confidence coefficient, and representing as follows:
,
Wherein, Represent the firstThe confidence level of the individual samples,Indicating that all samples are in noise tagAnd (3) setting a pruning threshold value according to the sum of the prediction probabilities below, and removing samples lower than the threshold value according to the confidence coefficient to obtain clean data, wherein the clean data is expressed as:
,
Wherein, A set of cleaning data is represented and,Indicating the set clipping threshold.
In the prior art, the data quality inspection mainly depends on manpower, has low efficiency and is easy to make mistakes, the efficiency and the accuracy of the data quality inspection can be greatly improved by automatically generating the mapping relation between the data standard and the data model and configuring and executing the quality inspection scheme, the manual intervention is reduced, the efficiency and the accuracy of the data quality inspection are improved, the consistency and the reliability of the data quality are ensured, and the real-time monitoring and feedback of the data quality are realized by executing the automatic quality inspection rule, so that the data problem is discovered and solved in time.
And S3, splitting the audit data, extracting correct data to a standard library, marking a data label to form a data asset, and refluxing the data to a data source to form a complete closed loop for data management.
Further, forming the data asset comprises shunting the audit data, archiving the abnormal data into an error data table, triggering a data quality correction flow, generating a data quality correction work order, extracting correct data into a standard library, and marking a data label to form the data asset.
It should be noted that, forming a complete closed loop for data management includes creating an error data archive and a quality detection report, analyzing quality problems in the report, separating various data from a problem data set, forming different problem data worksheets according to data sources, pushing to a data source system, tracking the repair of source data quality problems, entering a collection library from a front-end processor through data aggregation after the repair is completed, and performing repeated data quality management.
It should be further noted that, through data distribution and data backflow, complete closed loop of data management is realized, continuous improvement and optimization of data quality are ensured, in the prior art, continuous improvement of data quality is not realized through data management, through data distribution on audit data, correct data is extracted to a standard library and data labels are marked, data assets are formed, data backflow to a data source can be effectively solved, efficiency and effect of data management are improved, continuous improvement and optimization of data quality are ensured, complete closed loop of data management is formed, utilization value and reliability of data are further improved, a continuous improvement mechanism of data quality is formed, utilization value of data is improved through data asset and label labels, and high-quality data support is provided for data driving decisions of enterprises.
Embodiment 2 of the present invention provides a method for detecting dynamic arrangement quality of industrial data, and in order to verify the beneficial effects of the present invention, scientific demonstration is performed through economic benefit calculation and simulation experiments.
Firstly, establishing a mapping relation between data standards and metadata through an intelligent mapping function according to a service classification establishing technology, service, management and quality four-class data standard and a word root library, when recommending a standard for different metadata, if two or more metadata with the same Chinese name appear, then match is carried out, a matching rule comprises matching metadata with the same Chinese name, english name selection is included in the matched metadata, english name of first metadata is selected as uniform English name, excluding conditions comprise that no match is carried out if the condition that English names are the same but any word root is not included exists, when matching word root Chinese and synonym are carried out, preliminary screening is carried out on metadata through Chinese word roots and synonyms of the matched metadata, when the number of the matched metadata is more than or equal to 2, the selected metadata is considered to be matched, english name is directly used for matching the matched metadata, english name is directly used for English combination of the roots, and the word root is divided by underlining, when matching word root English and English name are carried out, word root English name are selected as uniform English name, no match is carried out, matching rule is carried out, quality is automatically assessed by the fact that the quality of the matching word is not included in the binary word is greater than 2, the binary word is judged by matching rule, the matching word root and the binary word is compared with the binary word is judged, the binary word is compared with the binary word, the binary word is estimated, the binary word is compared with the binary word is directly, the quality inspection scheme comprises quality inspection rules of value domain inspection, standard inspection, repeatability inspection, timeliness inspection, reference value inspection and logic inspection, wherein the quality inspection rules are freely combined, a data period, a data label and an operation strategy executed by the scheme are set, data to be detected are input into classifier models trained in advance, each classifier model classifies the input data, accuracy and recall of different classifiers on a data set are evaluated in a cross-validation mode, the average value of the accuracy and recall of the different classifiers is used as a health state evaluation score of training data, the inspection data are split, abnormal data are archived to an error data table, a data quality improvement flow is triggered, a data quality improvement work order is generated, the correct data are extracted to a standard base, the data label is marked, data are formed, the data are returned to the data source, a complete closed loop is formed, the complete closed loop of data treatment is formed comprises creating an error data base and a quality detection report, analysis is performed aiming at quality problems in the report, various data are separated from the problem data set, different problem work orders are formed according to the data source, the quality improvement work order is pushed to the data set, the error data are archived to the error data base, the error data are archived to the error data table, the quality improvement data is recorded, the quality improvement work sheet is repeatedly is recorded by the error data, and the error data is recorded into the error data table, and the quality improvement data is recorded by the error data is recorded by the reference data, and the quality correction work table is recorded and the error data is repeatedly, and the quality data is recorded.
Table 1 table of experimental data records
,
The standard coincidence rate of the urban data is recorded in the table, the data display is carried out by making an industrial data standard and carrying out standard falling metadata inspection, the standard coincidence rate of the urban data is over 85 percent, wherein the standard coincidence rate of the urban data is highest and reaches 98 percent, the result shows that the method can improve the standardization degree of the data and ensure the consistency and standardization of the data, the health state evaluation score is obtained by comprehensively calculating the accuracy and recall rate of different classifiers, the data shows that the health state evaluation score of the urban data is over 0.82 and reaches 0.95 at most, the mapping relation between the data standard and the data model is automatically generated by configuring a data quality model and a quality inspection rule, the accuracy and the efficiency of data quality inspection can be effectively improved by configuring and executing a quality inspection scheme, the data of all test objects are standardized, the data asset is formed, the method shows that the correct data is split, the correct data is extracted to the standard and marked with a data label, the improvement of the data is realized, the quality of the closed-loop data is improved, the closed-loop data is realized, and the continuous data quality is ensured.
Embodiment 3 referring to fig. 2, for an embodiment of the present invention, an industrial data dynamic arrangement quality detection system is provided, which includes a data standardization module, a quality auditing module, and a data distribution module.
The system comprises a data standardization module, a quality auditing module, a data splitting module and a data processing module, wherein the data standardization module is used for making an industrial data standard, carrying out standard falling metadata inspection, collecting all urban data to a collection library according to different data types and integration modes, the quality auditing module is used for configuring a data quality model and a quality testing rule, automatically generating a mapping relation between the data standard and the data model, configuring and executing a quality testing scheme to carry out data quality auditing, the data splitting module is used for splitting auditing data, extracting correct data to a standard library, marking data labels, forming data assets, and enabling the data to flow back to a data source to form a complete closed loop of data treatment.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium include an electrical connection (an electronic device) having one or more wires, a portable computer diskette (a magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of techniques known in the art, discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like. It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.