[go: up one dir, main page]

CN118838895B - Industrial data dynamic arrangement quality detection method and system - Google Patents

Industrial data dynamic arrangement quality detection method and system Download PDF

Info

Publication number
CN118838895B
CN118838895B CN202411313595.6A CN202411313595A CN118838895B CN 118838895 B CN118838895 B CN 118838895B CN 202411313595 A CN202411313595 A CN 202411313595A CN 118838895 B CN118838895 B CN 118838895B
Authority
CN
China
Prior art keywords
data
quality
metadata
matching
inspection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411313595.6A
Other languages
Chinese (zh)
Other versions
CN118838895A (en
Inventor
袁存发
汤幸福
毛旭初
陆文迪
胡迪
郑豹
汤世康
李重阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Luculent Smart Technologies Co ltd
Original Assignee
Luculent Smart Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Luculent Smart Technologies Co ltd filed Critical Luculent Smart Technologies Co ltd
Priority to CN202411313595.6A priority Critical patent/CN118838895B/en
Publication of CN118838895A publication Critical patent/CN118838895A/en
Application granted granted Critical
Publication of CN118838895B publication Critical patent/CN118838895B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种工业数据动态编排质量检测方法及系统,涉及数据治理和质量检测技术领域,包括制定工业数据标准,进行落标元数据检查,按不同的数据类型和集成方式,采集各城市数据到归集库;配置数据质量模型和质检规则,自动生成数据标准与数据模型映射关系,配置并执行质检方案,进行数据质量稽核;对稽核数据进行分流,正确数据提取到标准库,并标记数据标签,形成数据资产,数据回流到数据源头,形成数据治理完整闭环。本发明所述方法通过进行数据质量稽核,提高了数据质量检查的效率和准确性,减少了人工检查的工作量和误差;通过数据回流形成闭环管理,建立起持续改进的数据治理机制,保证数据质量的长期稳定和优化。

The present invention discloses a method and system for detecting the quality of dynamic arrangement of industrial data, which relates to the technical field of data governance and quality detection, including formulating industrial data standards, performing metadata inspection of dropped tags, collecting data of various cities into a collection library according to different data types and integration methods; configuring data quality models and quality inspection rules, automatically generating a mapping relationship between data standards and data models, configuring and executing quality inspection plans, and performing data quality audits; diverting audit data, extracting correct data into a standard library, and marking data labels to form data assets, and returning data to the source of data to form a complete closed loop of data governance. The method described in the present invention improves the efficiency and accuracy of data quality inspections and reduces the workload and errors of manual inspections by performing data quality audits; forms closed-loop management through data reflux, establishes a continuously improved data governance mechanism, and ensures the long-term stability and optimization of data quality.

Description

Industrial data dynamic arrangement quality detection method and system
Technical Field
The invention relates to the technical field of data management and quality detection, in particular to a method and a system for detecting dynamic arrangement quality of industrial data.
Background
With the rapid development of the industrial internet, the technology of collecting and processing industrial data has also advanced significantly, industrial data relates to multiple departments, multiple systems and multiple industries, including the fields of manufacturing, energy sources, traffic and the like, these data are generated by sensors, monitoring systems and other data collecting devices, and the aspects of equipment status, production flow, environmental monitoring and the like are covered, in order to effectively manage and utilize these huge data, the technology of data standardization and management has been developed, the existing technology of data management mainly depends on the architecture of data warehouse, data lake, data middle platform and the like, and by establishing unified data standard and specification, the data is cleaned, integrated and analyzed, so as to support the operation decision and intelligent manufacturing of enterprises, however, with the continuous growth of data scale and the complexity of data structure, the traditional technology of data management faces great challenges and needs more efficient and intelligent solutions.
Although the existing data management technology achieves a certain effect in the aspect of processing industrial data, a plurality of defects still exist, the data standard establishment lacks uniformity and flexibility, the data standards of different systems and departments are inconsistent, the data integration and interoperability are poor, the existing data quality inspection method mainly depends on static rules and manual auditing, the existing data quality inspection method cannot be automatically adapted to continuously changing data environments, the auditing efficiency is low, missed detection and false detection are easy to occur, in addition, the data acquisition and processing process is complex, the dynamic arrangement capability is lacking, the diversified data types and integration modes are difficult to deal with, the traditional data management flow is generally linear, closed-loop management is lacking, the real-time backflow and continuous optimization of data cannot be realized, and particularly in the aspect of data quality auditing, the existing technology mainly depends on predefined rules and models, the self-adaption capability and intelligent analysis means are lacking, and the large-scale, complex and changeable industrial data environments are difficult to deal with.
Disclosure of Invention
The present invention has been made in view of the above-described problems.
Therefore, the invention solves the technical problems that the existing data management method has non-uniform standard, low data quality inspection efficiency, incapability of dynamically adapting to the data environment and realization of complete closed loop of data management.
The technical scheme includes that an industrial data dynamic arrangement quality detection method comprises the steps of making industrial data standards, conducting falling standard metadata detection, collecting city data to an aggregation library according to different data types and integration modes, configuring a data quality model and quality inspection rules, automatically generating a mapping relation between the data standards and the data model, configuring and executing a quality inspection scheme to conduct data quality inspection, distributing inspection data, extracting correct data to the standard library, marking data labels to form data assets, and enabling the data to flow back to a data source to form a complete closed loop of data management.
The method comprises the steps of establishing a mapping relation between data standards and metadata through an intelligent mapping function according to a business classification establishing technology, business, management and quality four-class data standard and a word root library, when one standard is recommended for different metadata, matching is carried out when two or more metadata with the same Chinese name appear, matching rules comprise matching the metadata with the same Chinese name, english names are selected from the matched metadata, the English names of the first metadata are selected as uniform English names, the rule of exclusion comprises that matching is not carried out when the condition that English names are the same but no word root exists, when matching is carried out on word roots Chinese and word roots, the matching is carried out on the metadata through Chinese word roots and synonyms of the matching metadata, when the number of the matched metadata is more than or equal to 2, the matching is considered to be carried out on the matched metadata, the matching is carried out directly by adopting the combination of word roots, the English names are selected to be used as uniform English names, and when the number of the matching is not equal to 2, and when the matching is carried out on the English names, the English names are not considered to be matched, and the English names are not matched, and the matching is not considered to be the Chinese words are compared, and the matching is carried out.
The invention is used as a preferable scheme of the industrial data dynamic arrangement quality detection method, wherein the establishment of industrial data standards further comprises the steps of analyzing the metadata of the first 10 of the comprehensive key degrees, matching the metadata with the highest key degree, and calculating the key degree to be expressed as:
,
Wherein, Representing a criticality evaluation function,Representing the matching degree calculation function,Represents a matching root number calculation function,Representing a relevance calculating function, wherein the matching calculating function is used for measuring the similarity degree between two data objects and is expressed as follows:
,
Wherein, The number of root characters is matched for Chinese,For the number of Chinese characters,For the number of English matching root characters,Checking the meta data of the submerged buoy after the data standard is prepared, evaluating the meta data based on the number of the matched word roots and the association degree, verifying the quality of the data, and calculating the function of the number of the matched word roots to be expressed as:
,
Wherein, AndThe minimum and maximum matching numbers of Chinese respectively,AndThe minimum and maximum number of matches in english respectively, the relevance calculating function is expressed as:
,
Wherein, For the number of correlations to be used,AndThe minimum and maximum association numbers, respectively.
The method comprises the steps of establishing a data quality model and a data quality rule base according to data characteristics, automatically generating quality inspection rules from data standards, wherein the quality inspection rules comprise null value inspection, value range inspection, standard inspection, repeatability inspection, timeliness inspection, reference value inspection and logic inspection quality inspection rules, the quality inspection scheme comprises freely combining the quality inspection rules, setting data period, data labels and operation strategies executed by the scheme, inputting data to be detected into classifier models trained in advance, classifying the input data by each classifier model, evaluating the accuracy rate and recall rate of different classifiers on a data set by using a cross-validation mode, and taking the average value of the accuracy rate and the recall rate of the different classifiers as the health state evaluation score of training data.
As a preferable scheme of the industrial data dynamic arrangement quality detection method, the method for checking the data quality further comprises the steps of constructing a data set containing noise labelsExpressed as:
,
Wherein, Represent the firstThe characteristics of the industrial data of the individual samples,Represent the firstNoise signatures of the individual samples are used to determine,Representing the total number of samples, for noise dataPredicting to obtain the noise prediction probability of each sample, wherein the noise prediction probability is expressed as follows:
,
Wherein, Representing a given modelLower, the firstIndividual samplesNoise label of (a)Is used to determine the prediction probability of (1),Is a conditional probability expressed in a modelGiven a sampleTime noise labelBased on the prediction probability, calculating a noise labelAnd a genuine labelIs expressed as:,
Wherein, Representing confidence joint matrix, representing noise labelAnd a genuine labelIs used to determine the degree of confidence in the joint,Representing DIRAC DELTA functions whenThe value is 1 when the time is taken, otherwise, the time is 0,Represent the firstA plurality of real tags;
The tag combinations are counted and normalized to obtain a joint probability distribution, expressed as:
,
Wherein, Representing the normalized joint probability matrix,The confidence sum of all the real labels in the confidence joint matrix is represented and used for standardization;
by utilizing Cleanlab pruning function, the confidence joint matrix is used And normalized count dataPerforming data pruning, removing noise data, obtaining clean data, calculating the confidence coefficient of each sample, performing data pruning according to the confidence coefficient, and representing as follows:
,
Wherein, Represent the firstThe confidence level of the individual samples,Indicating that all samples are in noise tagAnd (3) setting a pruning threshold value according to the sum of the prediction probabilities below, and removing samples lower than the threshold value according to the confidence coefficient to obtain clean data, wherein the clean data is expressed as:
,
Wherein, A set of cleaning data is represented and,Indicating the set clipping threshold.
The method for detecting the dynamic arrangement quality of the industrial data comprises the steps of dividing audit data, archiving abnormal data into an error data table, triggering a data quality correction flow to generate a data quality correction work order, extracting correct data into a standard library, and marking a data label to form the data asset.
The invention is used as a preferable scheme of the industrial data dynamic arrangement quality detection method, wherein the forming of the complete closed loop of data management comprises the steps of creating an error data archive and a quality detection report, analyzing quality problems in the report, separating various data from a problem data set, forming different problem data worksheets according to data sources, pushing the different problem data worksheets to a data source system, tracking the restoration of source data quality problems, entering the aggregation library from a front-end processor again through data aggregation after the restoration is completed, and carrying out repeated data quality management.
Another object of the present invention is to provide an industrial data dynamic arrangement quality detection system, which can automatically generate a mapping relation between a data standard and a data model by configuring a data quality model and quality inspection rules, configure and execute a quality inspection scheme to perform data quality inspection, so as to solve the problem of low accuracy in quality inspection of the existing industrial data.
The industrial data dynamic arrangement quality detection system comprises a data standardization module, a quality auditing module and a data distribution module, wherein the data standardization module is used for making industrial data standards, checking standard falling metadata, collecting all city data to a collection base according to different data types and integration modes, the quality auditing module is used for configuring a data quality model and quality checking rules, automatically generating a mapping relation between the data standards and the data model, configuring and executing a quality checking scheme, and conducting data quality auditing, and the data distribution module is used for distributing auditing data, extracting correct data to a standard base, marking data labels, forming data assets, and enabling the data to flow back to a data source to form a complete data closed loop.
A computer device comprising a memory storing a computer program and a processor executing the computer program is a step of implementing a method for dynamic orchestration quality detection of industrial data.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a method for dynamic orchestration quality detection of industrial data.
The industrial data dynamic arrangement quality detection method has the beneficial effects that the industrial data dynamic arrangement quality detection method provided by the invention is used for acquiring all city data to the collection library according to different data types and integration modes by formulating industrial data standards and checking falling standard metadata, so that the uniformity and consistency of the data standards are ensured, the data integration and summarization are effectively carried out, a complete data resource library is formed, the standardization and consistency of the data are ensured, the data acquisition efficiency and accuracy are improved, the data is ensured to be subjected to preliminary standardized processing before entering the collection library, and the problem of inconsistent formats and types possibly occurring in the data integration process is reduced; the quality inspection scheme is configured and executed to perform data quality inspection, so that the automatic inspection and evaluation of the data quality are realized, the quality consistency of the data in different processing links is ensured, the data is subjected to comprehensive quality inspection, the quality problem in the data is found and marked, the efficiency and accuracy of the data quality inspection are improved, the workload and error of manual inspection are reduced, the quality inspection scheme is rapidly configured and executed by automatically generating the quality inspection rule and the mapping relation, the automation level of the data management is improved, the correct data is extracted to a standard library by shunting the inspection data, the data label is marked, the data asset is formed, the complete closed loop of the data management is formed, the refined management and the efficient utilization of the data are realized, the quick retrieval and classification of the data are realized, the data utilization rate and the management efficiency are improved, the data quality and consistency in the data management process are ensured, the invention has better effects in the aspects of accuracy, efficiency and reliability by forming closed-loop management through data backflow, establishing a continuously improved data management mechanism and ensuring long-term stability and optimization of data quality.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an overall flowchart of a method for detecting dynamic arrangement quality of industrial data according to a first embodiment of the present invention.
Fig. 2 is an overall flowchart of an industrial data dynamic arrangement quality detection system according to a third embodiment of the present invention.
Detailed Description
So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Embodiment 1, referring to fig. 1, for an embodiment of the present invention, there is provided a method for detecting dynamic arrangement quality of industrial data, including:
S1, making an industrial data standard, checking falling mark metadata, and collecting city data to a collection library according to different data types and integration modes.
Further, the method comprises the steps of establishing an industrial data standard comprising four data standards of service classification and service, management and quality and a word root library, establishing a mapping relation between the data standard and metadata through an intelligent mapping function, when recommending one standard for different metadata, matching if two or more metadata with the same Chinese name appear, matching the metadata with the same Chinese name, selecting English names from the matched metadata, selecting the English name of the first metadata as a unified English name, excluding conditions comprising that the English names are identical but do not contain any word root, performing preliminary screening on the metadata through Chinese word roots and synonyms of the matched metadata when matching word roots Chinese and synonyms, considering that the selected metadata is matched when the number of the matched metadata is more than or equal to 2, directly adopting English combinations of the word roots for the matched metadata, separating word roots by using English names, and performing preliminary screening on the English word roots by using the English names and the metadata with the abbreviation of the matched metadata when matching word roots and abbreviation of the matched metadata are not matched, and the English names of the English names are not matched to be considered to be equal to 2.
It should be noted that, the establishment of the industrial data standard further includes the analysis of the metadata of the first 10 of the comprehensive criticality, the matching of the metadata with the highest criticality is performed, and the calculation criticality is expressed as:
,
Wherein, Representing a criticality evaluation function,Representing the matching degree calculation function,Represents a matching root number calculation function,Representing a relevance calculating function, wherein the matching calculating function is used for measuring the similarity degree between two data objects and is expressed as follows:
,
Wherein, The number of root characters is matched for Chinese,For the number of Chinese characters,For the number of English matching root characters,Checking the meta data of the submerged buoy after the data standard is prepared, evaluating the meta data based on the number of the matched word roots and the association degree, verifying the quality of the data, and calculating the function of the number of the matched word roots to be expressed as:
,
Wherein, AndThe minimum and maximum matching numbers of Chinese respectively,AndThe minimum and maximum number of matches in english respectively, the relevance calculating function is expressed as:
,
Wherein, For the number of correlations to be used,AndThe minimum and maximum association numbers, respectively.
It should also be noted that, by formulating uniform data standards, the problem of inconsistent data standards is solved, the standardization and consistency of data are ensured, the data standards in the prior art are not uniform, the data integration difficulty is high, the data quality is uneven, by formulating industrial data standards and performing drop metadata inspection, the problem can be effectively solved, the efficiency of data acquisition and integration is improved, the standardization and consistency of data are ensured, a solid foundation is provided for subsequent data processing and analysis, by formulating industrial data standards, the uniformity and compatibility of various data between different cities and different systems are ensured, the efficiency of data integration is improved, the complexity of data format conversion is reduced, the requirements of classification database construction are graded according to the purpose of data use, unified planning resources are met, the organization and mining of standard unification and flow specification are performed on data resources, urban infrastructure data, monitoring data and supervision data are integrated, a platform-based modeling tool is used for constructing an urban security operation center containing a collection base, a main question base, a special question base, supporting data, a decision-making department is provided for supporting data, the data and a common operation guide system is improved, the data efficiency is improved, the data is shared between the data and the unified, and the data is shared, and the data is more efficient and the data is better used.
And S2, configuring a data quality model and quality inspection rules, automatically generating a mapping relation between a data standard and the data model, configuring and executing a quality inspection scheme, and auditing the data quality.
Furthermore, the data quality auditing comprises the steps of establishing a data quality model and a data quality rule base according to data characteristics, automatically generating quality inspection rules from data standards, wherein the quality inspection rules comprise blank value inspection, value domain inspection, standard inspection, repeatability inspection, timeliness inspection, reference value inspection and logic inspection, the quality inspection scheme comprises the steps of freely combining the quality inspection rules, setting a data period, a data label and an operation strategy executed by the scheme, inputting data to be detected into classifier models trained in advance, classifying the input data by each classifier model, and evaluating the accuracy and recall of different classifiers on a data set by using a cross-validation mode, wherein the average value of the accuracy and recall of the different classifiers is used as the health state evaluation score of training data.
It should be noted that the data quality auditing also includes constructing a data set containing noise labelsExpressed as:
,
Wherein, Represent the firstThe characteristics of the industrial data of the individual samples,Represent the firstNoise signatures of the individual samples are used to determine,Representing the total number of samples, for noise dataPredicting to obtain the noise prediction probability of each sample, wherein the noise prediction probability is expressed as follows:
,
Wherein, Representing a given modelLower, the firstIndividual samplesNoise label of (a)Is used to determine the prediction probability of (1),Is a conditional probability expressed in a modelGiven a sampleTime noise labelBased on the prediction probability, calculating a noise labelAnd a genuine labelIs expressed as:,
Wherein, Representing confidence joint matrix, representing noise labelAnd a genuine labelIs used to determine the degree of confidence in the joint,Representing DIRAC DELTA functions whenThe value is 1 when the time is taken, otherwise, the time is 0,Represent the firstA plurality of real tags;
The tag combinations are counted and normalized to obtain a joint probability distribution, expressed as:
,
Wherein, Representing the normalized joint probability matrix,The confidence sum of all the real labels in the confidence joint matrix is represented and used for standardization;
by utilizing Cleanlab pruning function, the confidence joint matrix is used And normalized count dataPerforming data pruning, removing noise data, obtaining clean data, calculating the confidence coefficient of each sample, performing data pruning according to the confidence coefficient, and representing as follows:
,
Wherein, Represent the firstThe confidence level of the individual samples,Indicating that all samples are in noise tagAnd (3) setting a pruning threshold value according to the sum of the prediction probabilities below, and removing samples lower than the threshold value according to the confidence coefficient to obtain clean data, wherein the clean data is expressed as:
,
Wherein, A set of cleaning data is represented and,Indicating the set clipping threshold.
In the prior art, the data quality inspection mainly depends on manpower, has low efficiency and is easy to make mistakes, the efficiency and the accuracy of the data quality inspection can be greatly improved by automatically generating the mapping relation between the data standard and the data model and configuring and executing the quality inspection scheme, the manual intervention is reduced, the efficiency and the accuracy of the data quality inspection are improved, the consistency and the reliability of the data quality are ensured, and the real-time monitoring and feedback of the data quality are realized by executing the automatic quality inspection rule, so that the data problem is discovered and solved in time.
And S3, splitting the audit data, extracting correct data to a standard library, marking a data label to form a data asset, and refluxing the data to a data source to form a complete closed loop for data management.
Further, forming the data asset comprises shunting the audit data, archiving the abnormal data into an error data table, triggering a data quality correction flow, generating a data quality correction work order, extracting correct data into a standard library, and marking a data label to form the data asset.
It should be noted that, forming a complete closed loop for data management includes creating an error data archive and a quality detection report, analyzing quality problems in the report, separating various data from a problem data set, forming different problem data worksheets according to data sources, pushing to a data source system, tracking the repair of source data quality problems, entering a collection library from a front-end processor through data aggregation after the repair is completed, and performing repeated data quality management.
It should be further noted that, through data distribution and data backflow, complete closed loop of data management is realized, continuous improvement and optimization of data quality are ensured, in the prior art, continuous improvement of data quality is not realized through data management, through data distribution on audit data, correct data is extracted to a standard library and data labels are marked, data assets are formed, data backflow to a data source can be effectively solved, efficiency and effect of data management are improved, continuous improvement and optimization of data quality are ensured, complete closed loop of data management is formed, utilization value and reliability of data are further improved, a continuous improvement mechanism of data quality is formed, utilization value of data is improved through data asset and label labels, and high-quality data support is provided for data driving decisions of enterprises.
Embodiment 2 of the present invention provides a method for detecting dynamic arrangement quality of industrial data, and in order to verify the beneficial effects of the present invention, scientific demonstration is performed through economic benefit calculation and simulation experiments.
Firstly, establishing a mapping relation between data standards and metadata through an intelligent mapping function according to a service classification establishing technology, service, management and quality four-class data standard and a word root library, when recommending a standard for different metadata, if two or more metadata with the same Chinese name appear, then match is carried out, a matching rule comprises matching metadata with the same Chinese name, english name selection is included in the matched metadata, english name of first metadata is selected as uniform English name, excluding conditions comprise that no match is carried out if the condition that English names are the same but any word root is not included exists, when matching word root Chinese and synonym are carried out, preliminary screening is carried out on metadata through Chinese word roots and synonyms of the matched metadata, when the number of the matched metadata is more than or equal to 2, the selected metadata is considered to be matched, english name is directly used for matching the matched metadata, english name is directly used for English combination of the roots, and the word root is divided by underlining, when matching word root English and English name are carried out, word root English name are selected as uniform English name, no match is carried out, matching rule is carried out, quality is automatically assessed by the fact that the quality of the matching word is not included in the binary word is greater than 2, the binary word is judged by matching rule, the matching word root and the binary word is compared with the binary word is judged, the binary word is compared with the binary word, the binary word is estimated, the binary word is compared with the binary word is directly, the quality inspection scheme comprises quality inspection rules of value domain inspection, standard inspection, repeatability inspection, timeliness inspection, reference value inspection and logic inspection, wherein the quality inspection rules are freely combined, a data period, a data label and an operation strategy executed by the scheme are set, data to be detected are input into classifier models trained in advance, each classifier model classifies the input data, accuracy and recall of different classifiers on a data set are evaluated in a cross-validation mode, the average value of the accuracy and recall of the different classifiers is used as a health state evaluation score of training data, the inspection data are split, abnormal data are archived to an error data table, a data quality improvement flow is triggered, a data quality improvement work order is generated, the correct data are extracted to a standard base, the data label is marked, data are formed, the data are returned to the data source, a complete closed loop is formed, the complete closed loop of data treatment is formed comprises creating an error data base and a quality detection report, analysis is performed aiming at quality problems in the report, various data are separated from the problem data set, different problem work orders are formed according to the data source, the quality improvement work order is pushed to the data set, the error data are archived to the error data base, the error data are archived to the error data table, the quality improvement data is recorded, the quality improvement work sheet is repeatedly is recorded by the error data, and the error data is recorded into the error data table, and the quality improvement data is recorded by the error data is recorded by the reference data, and the quality correction work table is recorded and the error data is repeatedly, and the quality data is recorded.
Table 1 table of experimental data records
,
The standard coincidence rate of the urban data is recorded in the table, the data display is carried out by making an industrial data standard and carrying out standard falling metadata inspection, the standard coincidence rate of the urban data is over 85 percent, wherein the standard coincidence rate of the urban data is highest and reaches 98 percent, the result shows that the method can improve the standardization degree of the data and ensure the consistency and standardization of the data, the health state evaluation score is obtained by comprehensively calculating the accuracy and recall rate of different classifiers, the data shows that the health state evaluation score of the urban data is over 0.82 and reaches 0.95 at most, the mapping relation between the data standard and the data model is automatically generated by configuring a data quality model and a quality inspection rule, the accuracy and the efficiency of data quality inspection can be effectively improved by configuring and executing a quality inspection scheme, the data of all test objects are standardized, the data asset is formed, the method shows that the correct data is split, the correct data is extracted to the standard and marked with a data label, the improvement of the data is realized, the quality of the closed-loop data is improved, the closed-loop data is realized, and the continuous data quality is ensured.
Embodiment 3 referring to fig. 2, for an embodiment of the present invention, an industrial data dynamic arrangement quality detection system is provided, which includes a data standardization module, a quality auditing module, and a data distribution module.
The system comprises a data standardization module, a quality auditing module, a data splitting module and a data processing module, wherein the data standardization module is used for making an industrial data standard, carrying out standard falling metadata inspection, collecting all urban data to a collection library according to different data types and integration modes, the quality auditing module is used for configuring a data quality model and a quality testing rule, automatically generating a mapping relation between the data standard and the data model, configuring and executing a quality testing scheme to carry out data quality auditing, the data splitting module is used for splitting auditing data, extracting correct data to a standard library, marking data labels, forming data assets, and enabling the data to flow back to a data source to form a complete closed loop of data treatment.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium include an electrical connection (an electronic device) having one or more wires, a portable computer diskette (a magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of techniques known in the art, discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like. It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.
It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims (7)

1.一种工业数据动态编排质量检测方法,其特征在于,包括:1. A method for detecting the quality of dynamic arrangement of industrial data, characterized by comprising: 制定工业数据标准,进行落标元数据检查,按不同的数据类型和集成方式,采集各城市数据到归集库;Formulate industrial data standards, conduct metadata checks, and collect data from various cities into a collection database according to different data types and integration methods; 配置数据质量模型和质检规则,自动生成数据标准与数据模型映射关系,配置并执行质检方案,进行数据质量稽核;Configure data quality models and quality inspection rules, automatically generate mapping relationships between data standards and data models, configure and execute quality inspection plans, and conduct data quality audits; 对稽核数据进行分流,正确数据提取到标准库,并标记数据标签,形成数据资产,数据回流到数据源头,形成数据治理完整闭环;Audit data is diverted, correct data is extracted to the standard library, and data tags are marked to form data assets. Data flows back to the data source to form a complete closed loop of data governance. 所述制定工业数据标准包括根据业务分类建立技术、业务、管理和质量四类数据标准和词根库,通过智能映射功能建立数据标准与元数据的映射关系;The formulation of industrial data standards includes establishing four types of data standards and root word libraries, namely technology, business, management and quality, according to business classification, and establishing a mapping relationship between data standards and metadata through intelligent mapping function; 当对不同元数据推荐一条标准时,若出现两个或两个以上具有相同中文名称的元数据时,进行匹配,匹配规则包括对中文名称相同的元数据进行匹配,英文名称选择包括在匹配到的元数据中,选取第一个元数据的英文名称作为统一的英文名称,排除条件包括若存在英文名称相同但不包含任何词根的情况,则不进行匹配;When a standard is recommended for different metadata, if two or more metadata with the same Chinese name appear, matching is performed. The matching rule includes matching metadata with the same Chinese name, and the English name selection is included in the matched metadata. The English name of the first metadata is selected as the unified English name. The exclusion condition includes not matching if there is a situation where the English name is the same but does not contain any root word; 当进行词根中文及同义词匹配时,通过匹配元数据的中文词根及同义词,对元数据进行初步筛选,当匹配到的元数据数量大于等于2个时,则认为选取的元数据是匹配的,对于匹配到的元数据,英文名称直接采用词根的英文组合,并用下划线分隔词根;When matching Chinese root words and synonyms, the metadata is preliminarily screened by matching the Chinese root words and synonyms of the metadata. When the number of matched metadata is greater than or equal to 2, the selected metadata is considered to be matched. For the matched metadata, the English name directly uses the English combination of the root words, and the root words are separated by underscores; 当进行词根英文及英文缩写匹配时,通过匹配元数据的英文词根及缩写,对元数据进行初步筛选,若匹配到的元数据数量大于等于2个,则认为选取的元数据是匹配的,对于匹配到的元数据,中文名称直接采用词根的中文组合,词根之间不进行分隔;When matching English roots and English abbreviations, the metadata is initially screened by matching the English roots and abbreviations of the metadata. If the number of matched metadata is greater than or equal to 2, the selected metadata is considered to be matched. For the matched metadata, the Chinese name directly uses the Chinese combination of the roots, and the roots are not separated; 制定工业数据标准还包括对综合关键程度前10的元数据进行分析,取关键程度最高的元数据进行匹配,计算关键程度表示为:The formulation of industrial data standards also includes analyzing the top 10 metadata with the highest comprehensive criticality, matching the metadata with the highest criticality, and calculating the criticality as: Key(x)=0.4·Sim)x,y)+0.25·RootCount(x)+0.35·Rel(x,y)Key(x)=0.4·Sim)x,y)+0.25·RootCount(x)+0.35·Rel(x,y) 其中,Key(x)表示关键程度评估函数,Sim(x,y)表示匹配度计算函数,RootCount(x)表示匹配词根数量计算函数,Rel(x,y)表示关联度计算函数;Among them, Key(x) represents the key degree evaluation function, Sim(x,y) represents the matching degree calculation function, RootCount(x) represents the matching root number calculation function, and Rel(x,y) represents the relevance calculation function; 匹配度计算函数用于衡量两个数据对象之间的相似性程度,表示为:The matching calculation function is used to measure the similarity between two data objects, expressed as: 其中,Cmatch为中文匹配词根字符个数,Ctotal为中文字符个数,Ematch为英文匹配词根字符个数,Etotal为英文字符个数;Where, C match is the number of Chinese matching root characters, C total is the number of Chinese characters, E match is the number of English matching root characters, and E total is the number of English characters; 在制定数据标准后,对潜标元数据进行检查,基于匹配词根数量和关联度进行评估,验证数据的质量,匹配词根数量计算函数表示为:After the data standards are established, the potential metadata is checked and evaluated based on the number of matching root words and the degree of relevance to verify the quality of the data. The function for calculating the number of matching root words is expressed as: 其中,Cmin和Cmax分别为中文最小和最大匹配数量,Emin和Emax分别为英文最小和最大匹配数量,关联度计算函数表示为:Among them, C min and C max are the minimum and maximum matching numbers of Chinese, E min and E max are the minimum and maximum matching numbers of English, respectively. The correlation calculation function is expressed as: 其中,R为关联数,Rmin和Rmax分别为最小和最大关联数;Where R is the correlation number, R min and R max are the minimum and maximum correlation numbers respectively; 进行数据质量稽核包括根据数据特征建立数据质量模型和数据质量规则库,并从数据标准自动生成质检规则,质检规则包括空值检查、值域检查、规范检查、重复性检查、及时性检查、参照值检查、逻辑检查的质量稽核规则,质检方案包括将质量稽核规则自由组合,并设置方案执行的数据期、数据标签和运行策略;Data quality auditing includes establishing a data quality model and a data quality rule base based on data characteristics, and automatically generating quality inspection rules from data standards. Quality inspection rules include null value inspection, value range inspection, specification inspection, repeatability inspection, timeliness inspection, reference value inspection, and logic inspection quality auditing rules. The quality inspection plan includes freely combining quality auditing rules and setting the data period, data label, and operation strategy for the execution of the plan. 将待质量检测数据输入预先训练好的分类器模型,每个分类器模型对输入数据进行分类处理,使用交叉验证的方式,评估不同分类器在数据集上的准确率和召回率,将不同分类器的准确率和召回率的均值作为训练数据的健康状态评估分数。The data to be quality inspected is input into the pre-trained classifier model. Each classifier model classifies the input data. The accuracy and recall rates of different classifiers on the data set are evaluated using cross-validation. The average of the accuracy and recall rates of different classifiers is used as the health status assessment score of the training data. 2.如权利要求1所述的工业数据动态编排质量检测方法,其特征在于:所述进行数据质量稽核还包括构建包含噪声标签的数据集X,表示为:2. The method for detecting the quality of dynamic arrangement of industrial data according to claim 1, wherein the data quality audit further comprises constructing a data set X containing noise labels, which is expressed as: 其中,xi表示第i个样本的工业数据特征,表示第i个样本的噪声标签,n表示样本的总数;Among them, xi represents the industrial data characteristics of the i-th sample, represents the noise label of the i-th sample, and n represents the total number of samples; 对噪声数据X进行预测,得到每个样本的噪声预测概率,表示为:Predict the noise data X and get the noise prediction probability of each sample, expressed as: 其中,表示给定模型θ下,第i个样本xi的噪声标签的预测概率,为条件概率,表示在模型θ下,给定样本xi时噪声标签的概率;in, Represents the noise label of the ith sample xi under a given model θ The predicted probability of is the conditional probability, indicating that under the model θ, given the sample xi, the noise label The probability of 基于预测概率,计算噪声标签和真实标签y*的联合概率分布,表示为:Based on the predicted probability, calculate the noise label The joint probability distribution of y and the true label y * is expressed as: 其中,表示置信联合矩阵,表示噪声标签和真实标签y*的联合置信度,δ表示Dirac delta函数,当时取值为1,否则为0,yj表示第j个真实标签;in, represents the confidence joint matrix, represents the noise label and the joint confidence of the true label y * , δ represents the Dirac delta function, when The value is 1 when yj=0, otherwise it is 0. yj represents the jth true label. 对标签组合进行计数,并标准化得到联合概率分布,表示为:The label combinations are counted and normalized to obtain the joint probability distribution, which is expressed as: 其中,表示标准化后的联合概率矩阵,表示置信联合矩阵中所有真实标签的置信度总和,用于标准化;in, represents the normalized joint probability matrix, Represents the sum of confidences of all true labels in the confidence union matrix, which is used for normalization; 利用Cleanlab的修剪功能,依据置信联合矩阵和标准化计数数据进行数据修剪,剔除噪声数据,得到清洁数据,计算每个样本的置信度,根据置信度进行数据修剪,表示为:Using Cleanlab's pruning function, based on the confidence union matrix and normalized count data Perform data pruning, remove noise data, obtain clean data, calculate the confidence of each sample, and perform data pruning based on the confidence, expressed as: 其中,表示第i个样本的置信度,表示所有样本在噪声标签下的预测概率总和,设定修剪阈值,根据置信度剔除低于阈值的样本,得到清洁数据,表示为:in, represents the confidence of the i-th sample, Indicates that all samples have noise labels The sum of the predicted probabilities under , sets the pruning threshold, removes samples below the threshold according to the confidence level, and obtains clean data, which is expressed as: 其中,Xclean表示清洁数据集合,τ表示设定的修剪阈值。Among them, X clean represents the clean data set, and τ represents the set pruning threshold. 3.如权利要求2所述的工业数据动态编排质量检测方法,其特征在于:所述形成数据资产包括对稽核数据进行分流,异常数据归档到错误数据表,并触发数据质量整改流程,生成数据质量整理工单,正确数据提取到标准库,并标记数据标签,形成数据资产。3. The industrial data dynamic orchestration quality detection method as described in claim 2 is characterized in that: the formation of data assets includes diverting audit data, archiving abnormal data to an error data table, triggering a data quality rectification process, generating a data quality sorting work order, extracting correct data to a standard library, and marking data tags to form data assets. 4.如权利要求3所述的工业数据动态编排质量检测方法,其特征在于:所述形成数据治理完整闭环包括创建错误数据归档库和质量检测报告,针对报告中的质量问题进行分析,从问题数据集中分离各项数据,根据数据来源形成不同问题数据工单,推送给数据源头系统,跟踪源头数据质量问题的修复,修复完成后再一次通过数据汇聚从前置机进入归集库,进行重复数据质量管理。4. The industrial data dynamic orchestration quality inspection method as described in claim 3 is characterized in that: the formation of a complete closed loop of data governance includes creating an error data archive and a quality inspection report, analyzing the quality problems in the report, separating each data from the problem data set, forming different problem data work orders according to the data source, pushing them to the data source system, tracking the repair of source data quality problems, and once the repair is completed, entering the collection library from the front-end machine again through data aggregation to perform duplicate data quality management. 5.一种采用如权利要求1~4任一所述的工业数据动态编排质量检测方法的系统,其特征在于:包括数据标准化模块,质量稽核模块,数据分流模块;5. A system using the industrial data dynamic arrangement quality detection method as claimed in any one of claims 1 to 4, characterized in that it comprises a data standardization module, a quality audit module, and a data diversion module; 所述数据标准化模块用于制定工业数据标准,进行落标元数据检查,按不同的数据类型和集成方式,采集各城市数据到归集库;The data standardization module is used to formulate industrial data standards, perform metadata checks, and collect data from various cities into a collection library according to different data types and integration methods; 所述质量稽核模块用于配置数据质量模型和质检规则,自动生成数据标准与数据模型映射关系,配置并执行质检方案,进行数据质量稽核;The quality audit module is used to configure data quality models and quality inspection rules, automatically generate data standards and data model mapping relationships, configure and execute quality inspection plans, and conduct data quality audits; 所述数据分流模块用于对稽核数据进行分流,正确数据提取到标准库,并标记数据标签,形成数据资产,数据回流到数据源头,形成数据治理完整闭环。The data diversion module is used to divert audit data, extract correct data to the standard library, and mark data tags to form data assets. The data flows back to the data source to form a complete closed loop of data governance. 6.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至4中任一项所述的工业数据动态编排质量检测方法的步骤。6. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the industrial data dynamic orchestration quality detection method according to any one of claims 1 to 4 when executing the computer program. 7.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至4中任一项所述的工业数据动态编排质量检测方法的步骤。7. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the industrial data dynamic arrangement quality detection method according to any one of claims 1 to 4 are implemented.
CN202411313595.6A 2024-09-20 2024-09-20 Industrial data dynamic arrangement quality detection method and system Active CN118838895B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411313595.6A CN118838895B (en) 2024-09-20 2024-09-20 Industrial data dynamic arrangement quality detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411313595.6A CN118838895B (en) 2024-09-20 2024-09-20 Industrial data dynamic arrangement quality detection method and system

Publications (2)

Publication Number Publication Date
CN118838895A CN118838895A (en) 2024-10-25
CN118838895B true CN118838895B (en) 2025-01-14

Family

ID=93148521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411313595.6A Active CN118838895B (en) 2024-09-20 2024-09-20 Industrial data dynamic arrangement quality detection method and system

Country Status (1)

Country Link
CN (1) CN118838895B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166641A (en) * 2022-11-18 2023-05-26 中广核风电有限公司 Mapping method, device, equipment and medium for data model and data standard
CN118227599A (en) * 2023-12-11 2024-06-21 中电鸿信信息科技有限公司 Data standard treatment method, system, equipment and medium based on automatic flow

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
NO316480B1 (en) * 2001-11-15 2004-01-26 Forinnova As Method and system for textual examination and discovery
CN113128781B (en) * 2021-04-30 2021-12-10 大连理工大学 A distributed industrial energy operation optimization platform that automatically builds intelligent models and algorithms
CN118503236A (en) * 2024-05-29 2024-08-16 江苏穿越金点信息科技股份有限公司 Intelligent data asset storage data evaluation method
CN118568423B (en) * 2024-08-02 2024-10-18 朗坤智慧科技股份有限公司 Method and system for intelligently realizing data cleaning by using AI model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116166641A (en) * 2022-11-18 2023-05-26 中广核风电有限公司 Mapping method, device, equipment and medium for data model and data standard
CN118227599A (en) * 2023-12-11 2024-06-21 中电鸿信信息科技有限公司 Data standard treatment method, system, equipment and medium based on automatic flow

Also Published As

Publication number Publication date
CN118838895A (en) 2024-10-25

Similar Documents

Publication Publication Date Title
CN105653444B (en) Software defect fault recognition method and system based on internet daily record data
CN111259947A (en) Power system fault early warning method and system based on multi-mode learning
CN118070202A (en) Industrial data quality control system based on artificial intelligence
Widad et al. Quality anomaly detection using predictive techniques: an extensive big data quality framework for reliable data analysis
CN116542800A (en) Intelligent financial statement analysis system based on cloud AI technology
CN115794803A (en) Engineering audit problem monitoring method and system based on big data AI technology
CN117764726B (en) Real estate financial risk prevention and control method and system based on big data and artificial intelligence
Subrahmanya et al. Advanced machine learning methods for production data pattern recognition
CN119205010A (en) A chemical industry hidden danger management method and system based on time series network analysis
CN113379212A (en) Block chain-based logistics information platform default risk assessment method, device, equipment and medium
CN113379211A (en) Block chain-based logistics information platform default risk management and control system and method
CN118838895B (en) Industrial data dynamic arrangement quality detection method and system
CN118469683A (en) Method for labeling easy-to-flow water of buses by banks based on large language model
CN113393169B (en) Financial industry transaction system performance index analysis method based on big data technology
CN116738328A (en) Industry classification and accounting method
CN112506930B (en) Data insight system based on machine learning technology
CN115544112A (en) Big data artificial intelligence based double-created-fruit high-value patent screening method
CN112380264A (en) Policy analysis and matching method and device based on personal full life cycle
CN117725156B (en) Method, system, device and medium for processing association of business data and financial data
CN119003782B (en) A method and system for checking duplicate questions in a computer-based examination question bank
US20240394564A1 (en) Exploratory offline generative online machine learning
CN117973872B (en) Supply chain risk identification method and device, electronic equipment and storage medium
CN118939768A (en) Enterprise intelligent question and answer screening method and system based on big model technology
CN117688503A (en) A mobile-based electricity safety inspection system
CN119808794A (en) A big data intelligent analysis method and system based on AI

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant