CN118838895B

CN118838895B - Industrial data dynamic arrangement quality detection method and system

Info

Publication number: CN118838895B
Application number: CN202411313595.6A
Authority: CN
Inventors: 袁存发; 汤幸福; 毛旭初; 陆文迪; 胡迪; 郑豹; 汤世康; 李重阳
Original assignee: Luculent Smart Technologies Co ltd
Current assignee: Luculent Smart Technologies Co ltd
Priority date: 2024-09-20
Filing date: 2024-09-20
Publication date: 2025-01-14
Anticipated expiration: 2044-09-20
Also published as: CN118838895A

Abstract

The present invention discloses a method and system for detecting the quality of dynamic arrangement of industrial data, which relates to the technical field of data governance and quality detection, including formulating industrial data standards, performing metadata inspection of dropped tags, collecting data of various cities into a collection library according to different data types and integration methods; configuring data quality models and quality inspection rules, automatically generating a mapping relationship between data standards and data models, configuring and executing quality inspection plans, and performing data quality audits; diverting audit data, extracting correct data into a standard library, and marking data labels to form data assets, and returning data to the source of data to form a complete closed loop of data governance. The method described in the present invention improves the efficiency and accuracy of data quality inspections and reduces the workload and errors of manual inspections by performing data quality audits; forms closed-loop management through data reflux, establishes a continuously improved data governance mechanism, and ensures the long-term stability and optimization of data quality.

Description

Industrial data dynamic arrangement quality detection method and system

Technical Field

The invention relates to the technical field of data management and quality detection, in particular to a method and a system for detecting dynamic arrangement quality of industrial data.

Background

With the rapid development of the industrial internet, the technology of collecting and processing industrial data has also advanced significantly, industrial data relates to multiple departments, multiple systems and multiple industries, including the fields of manufacturing, energy sources, traffic and the like, these data are generated by sensors, monitoring systems and other data collecting devices, and the aspects of equipment status, production flow, environmental monitoring and the like are covered, in order to effectively manage and utilize these huge data, the technology of data standardization and management has been developed, the existing technology of data management mainly depends on the architecture of data warehouse, data lake, data middle platform and the like, and by establishing unified data standard and specification, the data is cleaned, integrated and analyzed, so as to support the operation decision and intelligent manufacturing of enterprises, however, with the continuous growth of data scale and the complexity of data structure, the traditional technology of data management faces great challenges and needs more efficient and intelligent solutions.

Although the existing data management technology achieves a certain effect in the aspect of processing industrial data, a plurality of defects still exist, the data standard establishment lacks uniformity and flexibility, the data standards of different systems and departments are inconsistent, the data integration and interoperability are poor, the existing data quality inspection method mainly depends on static rules and manual auditing, the existing data quality inspection method cannot be automatically adapted to continuously changing data environments, the auditing efficiency is low, missed detection and false detection are easy to occur, in addition, the data acquisition and processing process is complex, the dynamic arrangement capability is lacking, the diversified data types and integration modes are difficult to deal with, the traditional data management flow is generally linear, closed-loop management is lacking, the real-time backflow and continuous optimization of data cannot be realized, and particularly in the aspect of data quality auditing, the existing technology mainly depends on predefined rules and models, the self-adaption capability and intelligent analysis means are lacking, and the large-scale, complex and changeable industrial data environments are difficult to deal with.

Disclosure of Invention

The present invention has been made in view of the above-described problems.

Therefore, the invention solves the technical problems that the existing data management method has non-uniform standard, low data quality inspection efficiency, incapability of dynamically adapting to the data environment and realization of complete closed loop of data management.

The technical scheme includes that an industrial data dynamic arrangement quality detection method comprises the steps of making industrial data standards, conducting falling standard metadata detection, collecting city data to an aggregation library according to different data types and integration modes, configuring a data quality model and quality inspection rules, automatically generating a mapping relation between the data standards and the data model, configuring and executing a quality inspection scheme to conduct data quality inspection, distributing inspection data, extracting correct data to the standard library, marking data labels to form data assets, and enabling the data to flow back to a data source to form a complete closed loop of data management.

The method comprises the steps of establishing a mapping relation between data standards and metadata through an intelligent mapping function according to a business classification establishing technology, business, management and quality four-class data standard and a word root library, when one standard is recommended for different metadata, matching is carried out when two or more metadata with the same Chinese name appear, matching rules comprise matching the metadata with the same Chinese name, english names are selected from the matched metadata, the English names of the first metadata are selected as uniform English names, the rule of exclusion comprises that matching is not carried out when the condition that English names are the same but no word root exists, when matching is carried out on word roots Chinese and word roots, the matching is carried out on the metadata through Chinese word roots and synonyms of the matching metadata, when the number of the matched metadata is more than or equal to 2, the matching is considered to be carried out on the matched metadata, the matching is carried out directly by adopting the combination of word roots, the English names are selected to be used as uniform English names, and when the number of the matching is not equal to 2, and when the matching is carried out on the English names, the English names are not considered to be matched, and the English names are not matched, and the matching is not considered to be the Chinese words are compared, and the matching is carried out.

The invention is used as a preferable scheme of the industrial data dynamic arrangement quality detection method, wherein the establishment of industrial data standards further comprises the steps of analyzing the metadata of the first 10 of the comprehensive key degrees, matching the metadata with the highest key degree, and calculating the key degree to be expressed as:

,

Wherein, Representing a criticality evaluation function,Representing the matching degree calculation function,Represents a matching root number calculation function,Representing a relevance calculating function, wherein the matching calculating function is used for measuring the similarity degree between two data objects and is expressed as follows:

,

Wherein, The number of root characters is matched for Chinese,For the number of Chinese characters,For the number of English matching root characters,Checking the meta data of the submerged buoy after the data standard is prepared, evaluating the meta data based on the number of the matched word roots and the association degree, verifying the quality of the data, and calculating the function of the number of the matched word roots to be expressed as:

,

Wherein, AndThe minimum and maximum matching numbers of Chinese respectively,AndThe minimum and maximum number of matches in english respectively, the relevance calculating function is expressed as:

,

Wherein, For the number of correlations to be used,AndThe minimum and maximum association numbers, respectively.

The method comprises the steps of establishing a data quality model and a data quality rule base according to data characteristics, automatically generating quality inspection rules from data standards, wherein the quality inspection rules comprise null value inspection, value range inspection, standard inspection, repeatability inspection, timeliness inspection, reference value inspection and logic inspection quality inspection rules, the quality inspection scheme comprises freely combining the quality inspection rules, setting data period, data labels and operation strategies executed by the scheme, inputting data to be detected into classifier models trained in advance, classifying the input data by each classifier model, evaluating the accuracy rate and recall rate of different classifiers on a data set by using a cross-validation mode, and taking the average value of the accuracy rate and the recall rate of the different classifiers as the health state evaluation score of training data.

As a preferable scheme of the industrial data dynamic arrangement quality detection method, the method for checking the data quality further comprises the steps of constructing a data set containing noise labelsExpressed as:

,

Wherein, Represent the firstThe characteristics of the industrial data of the individual samples,Represent the firstNoise signatures of the individual samples are used to determine,Representing the total number of samples, for noise dataPredicting to obtain the noise prediction probability of each sample, wherein the noise prediction probability is expressed as follows:

,

Wherein, Representing a given modelLower, the firstIndividual samplesNoise label of (a)Is used to determine the prediction probability of (1),Is a conditional probability expressed in a modelGiven a sampleTime noise labelBased on the prediction probability, calculating a noise labelAnd a genuine labelIs expressed as:,

Wherein, Representing confidence joint matrix, representing noise labelAnd a genuine labelIs used to determine the degree of confidence in the joint,Representing DIRAC DELTA functions whenThe value is 1 when the time is taken, otherwise, the time is 0,Represent the firstA plurality of real tags;

The tag combinations are counted and normalized to obtain a joint probability distribution, expressed as:

,

Wherein, Representing the normalized joint probability matrix,The confidence sum of all the real labels in the confidence joint matrix is represented and used for standardization;

by utilizing Cleanlab pruning function, the confidence joint matrix is used And normalized count dataPerforming data pruning, removing noise data, obtaining clean data, calculating the confidence coefficient of each sample, performing data pruning according to the confidence coefficient, and representing as follows:

,

Wherein, Represent the firstThe confidence level of the individual samples,Indicating that all samples are in noise tagAnd (3) setting a pruning threshold value according to the sum of the prediction probabilities below, and removing samples lower than the threshold value according to the confidence coefficient to obtain clean data, wherein the clean data is expressed as:

,

Wherein, A set of cleaning data is represented and,Indicating the set clipping threshold.

The method for detecting the dynamic arrangement quality of the industrial data comprises the steps of dividing audit data, archiving abnormal data into an error data table, triggering a data quality correction flow to generate a data quality correction work order, extracting correct data into a standard library, and marking a data label to form the data asset.

The invention is used as a preferable scheme of the industrial data dynamic arrangement quality detection method, wherein the forming of the complete closed loop of data management comprises the steps of creating an error data archive and a quality detection report, analyzing quality problems in the report, separating various data from a problem data set, forming different problem data worksheets according to data sources, pushing the different problem data worksheets to a data source system, tracking the restoration of source data quality problems, entering the aggregation library from a front-end processor again through data aggregation after the restoration is completed, and carrying out repeated data quality management.

Another object of the present invention is to provide an industrial data dynamic arrangement quality detection system, which can automatically generate a mapping relation between a data standard and a data model by configuring a data quality model and quality inspection rules, configure and execute a quality inspection scheme to perform data quality inspection, so as to solve the problem of low accuracy in quality inspection of the existing industrial data.

The industrial data dynamic arrangement quality detection system comprises a data standardization module, a quality auditing module and a data distribution module, wherein the data standardization module is used for making industrial data standards, checking standard falling metadata, collecting all city data to a collection base according to different data types and integration modes, the quality auditing module is used for configuring a data quality model and quality checking rules, automatically generating a mapping relation between the data standards and the data model, configuring and executing a quality checking scheme, and conducting data quality auditing, and the data distribution module is used for distributing auditing data, extracting correct data to a standard base, marking data labels, forming data assets, and enabling the data to flow back to a data source to form a complete data closed loop.

A computer device comprising a memory storing a computer program and a processor executing the computer program is a step of implementing a method for dynamic orchestration quality detection of industrial data.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a method for dynamic orchestration quality detection of industrial data.

The industrial data dynamic arrangement quality detection method has the beneficial effects that the industrial data dynamic arrangement quality detection method provided by the invention is used for acquiring all city data to the collection library according to different data types and integration modes by formulating industrial data standards and checking falling standard metadata, so that the uniformity and consistency of the data standards are ensured, the data integration and summarization are effectively carried out, a complete data resource library is formed, the standardization and consistency of the data are ensured, the data acquisition efficiency and accuracy are improved, the data is ensured to be subjected to preliminary standardized processing before entering the collection library, and the problem of inconsistent formats and types possibly occurring in the data integration process is reduced; the quality inspection scheme is configured and executed to perform data quality inspection, so that the automatic inspection and evaluation of the data quality are realized, the quality consistency of the data in different processing links is ensured, the data is subjected to comprehensive quality inspection, the quality problem in the data is found and marked, the efficiency and accuracy of the data quality inspection are improved, the workload and error of manual inspection are reduced, the quality inspection scheme is rapidly configured and executed by automatically generating the quality inspection rule and the mapping relation, the automation level of the data management is improved, the correct data is extracted to a standard library by shunting the inspection data, the data label is marked, the data asset is formed, the complete closed loop of the data management is formed, the refined management and the efficient utilization of the data are realized, the quick retrieval and classification of the data are realized, the data utilization rate and the management efficiency are improved, the data quality and consistency in the data management process are ensured, the invention has better effects in the aspects of accuracy, efficiency and reliability by forming closed-loop management through data backflow, establishing a continuously improved data management mechanism and ensuring long-term stability and optimization of data quality.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an overall flowchart of a method for detecting dynamic arrangement quality of industrial data according to a first embodiment of the present invention.

Fig. 2 is an overall flowchart of an industrial data dynamic arrangement quality detection system according to a third embodiment of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Embodiment 1, referring to fig. 1, for an embodiment of the present invention, there is provided a method for detecting dynamic arrangement quality of industrial data, including:

S1, making an industrial data standard, checking falling mark metadata, and collecting city data to a collection library according to different data types and integration modes.

Further, the method comprises the steps of establishing an industrial data standard comprising four data standards of service classification and service, management and quality and a word root library, establishing a mapping relation between the data standard and metadata through an intelligent mapping function, when recommending one standard for different metadata, matching if two or more metadata with the same Chinese name appear, matching the metadata with the same Chinese name, selecting English names from the matched metadata, selecting the English name of the first metadata as a unified English name, excluding conditions comprising that the English names are identical but do not contain any word root, performing preliminary screening on the metadata through Chinese word roots and synonyms of the matched metadata when matching word roots Chinese and synonyms, considering that the selected metadata is matched when the number of the matched metadata is more than or equal to 2, directly adopting English combinations of the word roots for the matched metadata, separating word roots by using English names, and performing preliminary screening on the English word roots by using the English names and the metadata with the abbreviation of the matched metadata when matching word roots and abbreviation of the matched metadata are not matched, and the English names of the English names are not matched to be considered to be equal to 2.

It should be noted that, the establishment of the industrial data standard further includes the analysis of the metadata of the first 10 of the comprehensive criticality, the matching of the metadata with the highest criticality is performed, and the calculation criticality is expressed as:

,

It should also be noted that, by formulating uniform data standards, the problem of inconsistent data standards is solved, the standardization and consistency of data are ensured, the data standards in the prior art are not uniform, the data integration difficulty is high, the data quality is uneven, by formulating industrial data standards and performing drop metadata inspection, the problem can be effectively solved, the efficiency of data acquisition and integration is improved, the standardization and consistency of data are ensured, a solid foundation is provided for subsequent data processing and analysis, by formulating industrial data standards, the uniformity and compatibility of various data between different cities and different systems are ensured, the efficiency of data integration is improved, the complexity of data format conversion is reduced, the requirements of classification database construction are graded according to the purpose of data use, unified planning resources are met, the organization and mining of standard unification and flow specification are performed on data resources, urban infrastructure data, monitoring data and supervision data are integrated, a platform-based modeling tool is used for constructing an urban security operation center containing a collection base, a main question base, a special question base, supporting data, a decision-making department is provided for supporting data, the data and a common operation guide system is improved, the data efficiency is improved, the data is shared between the data and the unified, and the data is shared, and the data is more efficient and the data is better used.

And S2, configuring a data quality model and quality inspection rules, automatically generating a mapping relation between a data standard and the data model, configuring and executing a quality inspection scheme, and auditing the data quality.

Furthermore, the data quality auditing comprises the steps of establishing a data quality model and a data quality rule base according to data characteristics, automatically generating quality inspection rules from data standards, wherein the quality inspection rules comprise blank value inspection, value domain inspection, standard inspection, repeatability inspection, timeliness inspection, reference value inspection and logic inspection, the quality inspection scheme comprises the steps of freely combining the quality inspection rules, setting a data period, a data label and an operation strategy executed by the scheme, inputting data to be detected into classifier models trained in advance, classifying the input data by each classifier model, and evaluating the accuracy and recall of different classifiers on a data set by using a cross-validation mode, wherein the average value of the accuracy and recall of the different classifiers is used as the health state evaluation score of training data.

It should be noted that the data quality auditing also includes constructing a data set containing noise labelsExpressed as:

,

In the prior art, the data quality inspection mainly depends on manpower, has low efficiency and is easy to make mistakes, the efficiency and the accuracy of the data quality inspection can be greatly improved by automatically generating the mapping relation between the data standard and the data model and configuring and executing the quality inspection scheme, the manual intervention is reduced, the efficiency and the accuracy of the data quality inspection are improved, the consistency and the reliability of the data quality are ensured, and the real-time monitoring and feedback of the data quality are realized by executing the automatic quality inspection rule, so that the data problem is discovered and solved in time.

And S3, splitting the audit data, extracting correct data to a standard library, marking a data label to form a data asset, and refluxing the data to a data source to form a complete closed loop for data management.

Further, forming the data asset comprises shunting the audit data, archiving the abnormal data into an error data table, triggering a data quality correction flow, generating a data quality correction work order, extracting correct data into a standard library, and marking a data label to form the data asset.

It should be noted that, forming a complete closed loop for data management includes creating an error data archive and a quality detection report, analyzing quality problems in the report, separating various data from a problem data set, forming different problem data worksheets according to data sources, pushing to a data source system, tracking the repair of source data quality problems, entering a collection library from a front-end processor through data aggregation after the repair is completed, and performing repeated data quality management.

It should be further noted that, through data distribution and data backflow, complete closed loop of data management is realized, continuous improvement and optimization of data quality are ensured, in the prior art, continuous improvement of data quality is not realized through data management, through data distribution on audit data, correct data is extracted to a standard library and data labels are marked, data assets are formed, data backflow to a data source can be effectively solved, efficiency and effect of data management are improved, continuous improvement and optimization of data quality are ensured, complete closed loop of data management is formed, utilization value and reliability of data are further improved, a continuous improvement mechanism of data quality is formed, utilization value of data is improved through data asset and label labels, and high-quality data support is provided for data driving decisions of enterprises.

Embodiment 2 of the present invention provides a method for detecting dynamic arrangement quality of industrial data, and in order to verify the beneficial effects of the present invention, scientific demonstration is performed through economic benefit calculation and simulation experiments.

Firstly, establishing a mapping relation between data standards and metadata through an intelligent mapping function according to a service classification establishing technology, service, management and quality four-class data standard and a word root library, when recommending a standard for different metadata, if two or more metadata with the same Chinese name appear, then match is carried out, a matching rule comprises matching metadata with the same Chinese name, english name selection is included in the matched metadata, english name of first metadata is selected as uniform English name, excluding conditions comprise that no match is carried out if the condition that English names are the same but any word root is not included exists, when matching word root Chinese and synonym are carried out, preliminary screening is carried out on metadata through Chinese word roots and synonyms of the matched metadata, when the number of the matched metadata is more than or equal to 2, the selected metadata is considered to be matched, english name is directly used for matching the matched metadata, english name is directly used for English combination of the roots, and the word root is divided by underlining, when matching word root English and English name are carried out, word root English name are selected as uniform English name, no match is carried out, matching rule is carried out, quality is automatically assessed by the fact that the quality of the matching word is not included in the binary word is greater than 2, the binary word is judged by matching rule, the matching word root and the binary word is compared with the binary word is judged, the binary word is compared with the binary word, the binary word is estimated, the binary word is compared with the binary word is directly, the quality inspection scheme comprises quality inspection rules of value domain inspection, standard inspection, repeatability inspection, timeliness inspection, reference value inspection and logic inspection, wherein the quality inspection rules are freely combined, a data period, a data label and an operation strategy executed by the scheme are set, data to be detected are input into classifier models trained in advance, each classifier model classifies the input data, accuracy and recall of different classifiers on a data set are evaluated in a cross-validation mode, the average value of the accuracy and recall of the different classifiers is used as a health state evaluation score of training data, the inspection data are split, abnormal data are archived to an error data table, a data quality improvement flow is triggered, a data quality improvement work order is generated, the correct data are extracted to a standard base, the data label is marked, data are formed, the data are returned to the data source, a complete closed loop is formed, the complete closed loop of data treatment is formed comprises creating an error data base and a quality detection report, analysis is performed aiming at quality problems in the report, various data are separated from the problem data set, different problem work orders are formed according to the data source, the quality improvement work order is pushed to the data set, the error data are archived to the error data base, the error data are archived to the error data table, the quality improvement data is recorded, the quality improvement work sheet is repeatedly is recorded by the error data, and the error data is recorded into the error data table, and the quality improvement data is recorded by the error data is recorded by the reference data, and the quality correction work table is recorded and the error data is repeatedly, and the quality data is recorded.

Table 1 table of experimental data records

,

The standard coincidence rate of the urban data is recorded in the table, the data display is carried out by making an industrial data standard and carrying out standard falling metadata inspection, the standard coincidence rate of the urban data is over 85 percent, wherein the standard coincidence rate of the urban data is highest and reaches 98 percent, the result shows that the method can improve the standardization degree of the data and ensure the consistency and standardization of the data, the health state evaluation score is obtained by comprehensively calculating the accuracy and recall rate of different classifiers, the data shows that the health state evaluation score of the urban data is over 0.82 and reaches 0.95 at most, the mapping relation between the data standard and the data model is automatically generated by configuring a data quality model and a quality inspection rule, the accuracy and the efficiency of data quality inspection can be effectively improved by configuring and executing a quality inspection scheme, the data of all test objects are standardized, the data asset is formed, the method shows that the correct data is split, the correct data is extracted to the standard and marked with a data label, the improvement of the data is realized, the quality of the closed-loop data is improved, the closed-loop data is realized, and the continuous data quality is ensured.

Embodiment 3 referring to fig. 2, for an embodiment of the present invention, an industrial data dynamic arrangement quality detection system is provided, which includes a data standardization module, a quality auditing module, and a data distribution module.

The system comprises a data standardization module, a quality auditing module, a data splitting module and a data processing module, wherein the data standardization module is used for making an industrial data standard, carrying out standard falling metadata inspection, collecting all urban data to a collection library according to different data types and integration modes, the quality auditing module is used for configuring a data quality model and a quality testing rule, automatically generating a mapping relation between the data standard and the data model, configuring and executing a quality testing scheme to carry out data quality auditing, the data splitting module is used for splitting auditing data, extracting correct data to a standard library, marking data labels, forming data assets, and enabling the data to flow back to a data source to form a complete closed loop of data treatment.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium include an electrical connection (an electronic device) having one or more wires, a portable computer diskette (a magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of techniques known in the art, discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like. It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A method for detecting the quality of dynamic arrangement of industrial data, characterized by comprising:

Formulate industrial data standards, conduct metadata checks, and collect data from various cities into a collection database according to different data types and integration methods;

Configure data quality models and quality inspection rules, automatically generate mapping relationships between data standards and data models, configure and execute quality inspection plans, and conduct data quality audits;

Audit data is diverted, correct data is extracted to the standard library, and data tags are marked to form data assets. Data flows back to the data source to form a complete closed loop of data governance.

The formulation of industrial data standards includes establishing four types of data standards and root word libraries, namely technology, business, management and quality, according to business classification, and establishing a mapping relationship between data standards and metadata through intelligent mapping function;

When a standard is recommended for different metadata, if two or more metadata with the same Chinese name appear, matching is performed. The matching rule includes matching metadata with the same Chinese name, and the English name selection is included in the matched metadata. The English name of the first metadata is selected as the unified English name. The exclusion condition includes not matching if there is a situation where the English name is the same but does not contain any root word;

When matching Chinese root words and synonyms, the metadata is preliminarily screened by matching the Chinese root words and synonyms of the metadata. When the number of matched metadata is greater than or equal to 2, the selected metadata is considered to be matched. For the matched metadata, the English name directly uses the English combination of the root words, and the root words are separated by underscores;

When matching English roots and English abbreviations, the metadata is initially screened by matching the English roots and abbreviations of the metadata. If the number of matched metadata is greater than or equal to 2, the selected metadata is considered to be matched. For the matched metadata, the Chinese name directly uses the Chinese combination of the roots, and the roots are not separated;

The formulation of industrial data standards also includes analyzing the top 10 metadata with the highest comprehensive criticality, matching the metadata with the highest criticality, and calculating the criticality as:

Key(x)＝0.4·Sim)x,y)+0.25·RootCount(x)+0.35·Rel(x,y)

Among them, Key(x) represents the key degree evaluation function, Sim(x,y) represents the matching degree calculation function, RootCount(x) represents the matching root number calculation function, and Rel(x,y) represents the relevance calculation function;

The matching calculation function is used to measure the similarity between two data objects, expressed as:

Where, C _match is the number of Chinese matching root characters, C _total is the number of Chinese characters, E _match is the number of English matching root characters, and E _total is the number of English characters;

After the data standards are established, the potential metadata is checked and evaluated based on the number of matching root words and the degree of relevance to verify the quality of the data. The function for calculating the number of matching root words is expressed as:

Among them, C _min and C _max are the minimum and maximum matching numbers of Chinese, E _min and E _max are the minimum and maximum matching numbers of English, respectively. The correlation calculation function is expressed as:

Where R is the correlation number, R _min and R _max are the minimum and maximum correlation numbers respectively;

Data quality auditing includes establishing a data quality model and a data quality rule base based on data characteristics, and automatically generating quality inspection rules from data standards. Quality inspection rules include null value inspection, value range inspection, specification inspection, repeatability inspection, timeliness inspection, reference value inspection, and logic inspection quality auditing rules. The quality inspection plan includes freely combining quality auditing rules and setting the data period, data label, and operation strategy for the execution of the plan.

The data to be quality inspected is input into the pre-trained classifier model. Each classifier model classifies the input data. The accuracy and recall rates of different classifiers on the data set are evaluated using cross-validation. The average of the accuracy and recall rates of different classifiers is used as the health status assessment score of the training data.

2. The method for detecting the quality of dynamic arrangement of industrial data according to claim 1, wherein the data quality audit further comprises constructing a data set X containing noise labels, which is expressed as:

Among them, _xi represents the industrial data characteristics of the i-th sample, represents the noise label of the i-th sample, and n represents the total number of samples;

Predict the noise data X and get the noise prediction probability of each sample, expressed as:

in, Represents the noise label of the ith sample _xi under a given model θ The predicted probability of is the conditional probability, indicating that under the model θ, given the sample _xi, the noise label The probability of

Based on the predicted probability, calculate the noise label The joint probability distribution of y and the true label y ^* is expressed as:

in, represents the confidence joint matrix, represents the noise label and the joint confidence of the true label y ^* , δ represents the Dirac delta function, when The value is 1 when yj=0, otherwise it is 0. _yj represents the jth true label.

The label combinations are counted and normalized to obtain the joint probability distribution, which is expressed as:

in, represents the normalized joint probability matrix, Represents the sum of confidences of all true labels in the confidence union matrix, which is used for normalization;

Using Cleanlab's pruning function, based on the confidence union matrix and normalized count data Perform data pruning, remove noise data, obtain clean data, calculate the confidence of each sample, and perform data pruning based on the confidence, expressed as:

in, represents the confidence of the i-th sample, Indicates that all samples have noise labels The sum of the predicted probabilities under , sets the pruning threshold, removes samples below the threshold according to the confidence level, and obtains clean data, which is expressed as:

Among them, X _clean represents the clean data set, and τ represents the set pruning threshold.

3. The industrial data dynamic orchestration quality detection method as described in claim 2 is characterized in that: the formation of data assets includes diverting audit data, archiving abnormal data to an error data table, triggering a data quality rectification process, generating a data quality sorting work order, extracting correct data to a standard library, and marking data tags to form data assets.

4. The industrial data dynamic orchestration quality inspection method as described in claim 3 is characterized in that: the formation of a complete closed loop of data governance includes creating an error data archive and a quality inspection report, analyzing the quality problems in the report, separating each data from the problem data set, forming different problem data work orders according to the data source, pushing them to the data source system, tracking the repair of source data quality problems, and once the repair is completed, entering the collection library from the front-end machine again through data aggregation to perform duplicate data quality management.

5. A system using the industrial data dynamic arrangement quality detection method as claimed in any one of claims 1 to 4, characterized in that it comprises a data standardization module, a quality audit module, and a data diversion module;

The data standardization module is used to formulate industrial data standards, perform metadata checks, and collect data from various cities into a collection library according to different data types and integration methods;

The quality audit module is used to configure data quality models and quality inspection rules, automatically generate data standards and data model mapping relationships, configure and execute quality inspection plans, and conduct data quality audits;

The data diversion module is used to divert audit data, extract correct data to the standard library, and mark data tags to form data assets. The data flows back to the data source to form a complete closed loop of data governance.

6. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the industrial data dynamic orchestration quality detection method according to any one of claims 1 to 4 when executing the computer program.

7. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the industrial data dynamic arrangement quality detection method according to any one of claims 1 to 4 are implemented.