[go: up one dir, main page]

CN110968576A - Content correlation-based numerical data consistency cleaning method - Google Patents

Content correlation-based numerical data consistency cleaning method Download PDF

Info

Publication number
CN110968576A
CN110968576A CN201911189468.9A CN201911189468A CN110968576A CN 110968576 A CN110968576 A CN 110968576A CN 201911189468 A CN201911189468 A CN 201911189468A CN 110968576 A CN110968576 A CN 110968576A
Authority
CN
China
Prior art keywords
data
cnfd
repair
repairing
consistency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911189468.9A
Other languages
Chinese (zh)
Inventor
张健沛
张倩玉
杨静
王勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201911189468.9A priority Critical patent/CN110968576A/en
Publication of CN110968576A publication Critical patent/CN110968576A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供的是一种基于内容相关的数值型数据一致性清洗方法。(1)利用CNFD对数据进行规则发现和合并;(2)使用CNFD‑detect算法检测数据中存在的不一致数据;(3)使用CNFD‑repair算法对数据中不一致数据进行修复;(4)对于修复过的数据进行再次检测和修复。本发明采用CNFD‑detect算法检测出不一致数据,判断规则是否可能发生死锁,若是则将可能发生死锁规则放在一起,统一赋值再进行修复,若否则直接修复;对修复后的数据迭代进行检测,以解决修复产生新的不一致矛盾。本发明在数据一致性清洗方向,无论是错误检测率还是错误修复率,或是实际中的应用方面,均表现出了优越性和极好的适应性。

Figure 201911189468

The invention provides a content-related numerical data consistency cleaning method. (1) Use CNFD to discover and merge data by rules; (2) Use CNFD-detect algorithm to detect inconsistent data in the data; (3) Use CNFD-repair algorithm to repair inconsistent data in the data; (4) For repairing The past data is checked and repaired again. The present invention uses CNFD-detect algorithm to detect inconsistent data, and judges whether the rules may be deadlocked. If so, the rules that may be deadlocked are put together, assigned uniformly, and then repaired. If not, it is repaired directly; Detected to resolve new inconsistencies arising from fixes. The present invention shows superiority and excellent adaptability in the direction of data consistency cleaning, whether in error detection rate or error repair rate, or in practical application.

Figure 201911189468

Description

Content correlation-based numerical data consistency cleaning method
Technical Field
The invention relates to a data consistency maintenance method of associated data, in particular to a numerical data consistency cleaning method.
Background
With the development of information technology and the popularization of computer application, especially the advance of network technology, such as the appearance of Web2.0 technology, people have come to the "big data age". Data quality management is a classic problem in the field of databases. Data quality is always a topic of much interest, both in traditional databases and in "big data" environments. A quality database system should be able to provide a certain, maintainable quality data information based on the needs of the various queries. The data information can truly and accurately reflect the actual condition of the objective world.
However, real-life data often presents data quality problems, and such data is often also referred to as "dirty data," e.g., incomplete, inconsistent, redundant, stale, etc. The dirty data not only affects the overall quality of the data, but also affects data analysis (such as data mining, OLAP analysis, etc.), eventually leading to decision errors and bringing great economic loss to people. According to the investigation of a German professional data analysis agency, "losses due to poor quality data in the United states are as high as $ 6000 billion per year. Therefore, it is very important to improve the data quality starting from the solution of "dirty data".
The data quality management comprises five contents of data consistency, data redundancy, data integrity, data accuracy and data timeliness. Data consistency describes the availability and correctness of objective entities, and data meeting the consistency can avoid data conflict and semantic ambiguity. Among them, the use of rule constraints for data consistency maintenance is the most effective technical means, such as Function Dependence (FD), conditional function dependence (NFD), and Numerical Function Dependence (NFD) on numerical data. NFDs can catch errors in the numerical data, but there are still some potential errors in the numerical data that cannot be detected by the NFD, which in turn affects the overall quality of the data. Therefore, a rule constraint that can detect and repair these potential errors is highly desirable.
The content expressed between different data has an association relationship, and the association relationship can be used for detecting potential errors in the data and further repairing the potential errors. The association relationship between the data is also reflected in the constraint rule related to the data, so that the NFDs are also associated with each other.
Current research work on data quality management of associated data mainly includes error discovery and error repair:
false discovery refers to the discovery of data that has quality problems. The scholars at home and abroad deeply research the error discovery and obtain important achievements. Among them, the most famous and most effective classification methods in the existing literature mainly include: entity identification, a distance function for entity identification, attributing data meeting distance requirements to corresponding entities; error discovery based on rule constraints-for different error types, researchers propose a variety of rule constraints, including function dependence, conditional function dependence, containment constraints, aging constraints, negative constraints, and the like; and (3) based on error discovery of the main data, namely matching the data to be processed with the data source by using a matching rule, and further carrying out data detection and repair.
Data repair refers to a data operation process for repairing detected data. Error repair methods can be classified into the following three types according to different repair strategies: the method uses a function dependence to maintain the consistency of data, uses a heuristic method to match data of a character string type in the repairing process, and selects a strategy with the minimum cost to repair; finding a true value, namely calculating the confidence coefficient of data by using a Bayesian model, and selecting the data meeting the maximum posterior probability as a true value; machine learning based repair — researchers get repair target values by using some existing machine learning methods (such as decision trees, bayesian networks, neural networks, etc.) on a training set.
Disclosure of Invention
The invention aims to provide a method for cleaning the consistency of numerical data based on content correlation, which has high error detection rate and correction accuracy.
The purpose of the invention is realized as follows:
(1) using CNFD to discover and combine the data rule;
(2) using a CNFD-detect algorithm to detect inconsistent data in the data;
(3) repairing inconsistent data in the data by using a CNFD-repair algorithm;
(4) and carrying out detection and repair again on the repaired data.
The invention may also include such features:
1. in step (1), the data is represented into an NFD dependency set by adopting an NFD rule design method, and rules with the same dependency format, namely related rules, are combined into a CNFD set.
2. Using a CNFD-detect algorithm in the step (2) to calculate tuples which satisfy the CNFD sets in the input data relation examples, judging the consistency state of the calculated examples, and returning to an empty set if the consistency requirements are satisfied; if not, positioning the inconsistent data and outputting an inconsistent data set.
3. Using a CNFD-repair algorithm in the step (3), inputting data into an inconsistent data set, firstly detecting a rule of repairing deadlock in the data during repairing, putting the rule of repairing deadlock together for detection, and endowing the same repairing target value; and directly repairing the error data without deadlock.
4. And (4) adopting a CNFD iterative cleaning framework in the step (4), and carrying out iterative detection and repair on the repaired data until the error rate is lower than a threshold value, thereby ending the process.
The invention provides a numerical data consistency cleaning method based on content correlation. The main characteristics include: (1) a CNFD-detect algorithm is provided for detecting the consistency of data; (2) a CNFD-repair algorithm is proposed to repair inconsistent data; (3) and after repairing, performing iterative detection and repairing again. The invention provides a numerical data consistency cleaning method based on content correlation, aiming at repairing numerical data with higher accuracy. Detecting inconsistent data by adopting a CNFD-detect algorithm, judging whether the rules are possible to deadlock or not, if so, putting the rules which are possible to deadlock together, assigning values uniformly and then repairing, and if not, directly repairing; and (4) detecting the repaired data iteration so as to solve the problem that the repair generates new inconsistency. The invention shows superiority and excellent adaptability in the aspect of data consistency cleaning direction, no matter error detection rate or error repair rate, or practical application.
Compared with the prior art, the invention has the advantages that: a. the CNFD combines the rules, so that the detection times are reduced, and the consumed running time is shorter; b. when the error detection is carried out on the data, the data under other conditions needs to be used for detection, so the error detection rate is higher; c. the invention has higher correction accuracy rate when carrying out error repair. Therefore, the invention provides the data cleaning by using the content-related numerical function dependence, can achieve higher detection accuracy and error recovery rate by detecting related content, and has expansibility on content-related numerical data sets.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIGS. 2(a) through 2(c) are run-time comparisons of the present method with conventional methods, FIG. 2(a) being run-time on a digitized results dataset, FIG. 2(b) being run-time on a digitized Census-inner dataset, and FIG. 2(c) being run-time on an hadlets dataset;
FIG. 3 is a comparison of error detection rates of the present method and the conventional method with respect to error data ratios;
FIG. 4 is a comparison of error repair accuracy for the error data ratio between the present method and the conventional method;
FIG. 5 is a process flow of relationship example E after washing;
FIG. 6 is a process flow of example I' after repair;
FIG. 7 is a relation example D after cleaningcThe process flow of' is described.
Detailed Description
The invention relates to a content-correlation-based numerical data consistency cleaning method, which mainly comprises the following steps:
(1) consistency check (CNFD-detect)
The input data are a relation example and a CNFD set, tuples meeting the CNFD set in the relation example are calculated, the consistency state of the calculated example is judged, and if the consistency requirement is met, an empty set is returned; if not, positioning the inconsistent data and outputting an inconsistent data set.
(2) Consistency repair (CNFD-repair)
Inputting an inconsistent data set, firstly detecting a rule of repairing deadlock possibly existing in data during repairing, putting the rule together for detection, and endowing the same repairing target value; and directly repairing the error data without deadlock.
(3) Iterative detection and repair
For the repaired data, since the data repair process only repairs the currently generated error and cannot consider the global state, the current repair result may affect other data, thereby causing a new consistency contradiction, and therefore, the detection needs to be performed again after the repair.
The invention is described in more detail below by way of example.
With reference to fig. 1, the specific steps of the present invention are as follows:
(1) consistency check (CNFD-detect)
After writing and modifying the numerical function dependence into the conditional function dependence format Y → a, a definition of content-dependent numerical function dependence is given herein.
A content dependent numerical function dependence (CNFD) of 1 is defined.
A content dependent numerical function dependence on R of Ψ
Ψ:(C|Y→A,Sc) (definition 1)
Where C is a condition attribute set, Y is a variable attribute set, C and Y are separated by "|",
Figure BDA0002293195690000042
and is
Figure BDA0002293195690000041
C ∪ Y represents the left part of rule Ψ, denoted LHS (Ψ). A represents the right part of rule Ψ, and A ∈ attr (R), denoted RHS (R).
Content-dependent numerical function dependencies are derived from a combination of numerical function dependencies, and numerical function dependencies having the same (C | Y → a) form as candidate rules for the combination result in a CNFD set.
The input data are a relation example and a CNFD set, tuples meeting the CNFD set in the relation example are calculated, the consistency state of the calculated example is judged, and if the consistency requirement is met, an empty set is returned; if not, positioning the inconsistent data and outputting an inconsistent data set.
The CNFD-detect algorithm is specifically as follows:
inputting: relationship example I
Content dependent set of numerical condition functions Ψ
And (3) outputting: relation example E after washing.
(2) Consistency repair (CNFD-repair)
Inputting an inconsistent data set, firstly detecting a rule of repairing deadlock possibly existing in data during repairing, putting the rule together for detection, and endowing the same repairing target value; and directly repairing the error data without deadlock.
The CNFD-repair algorithm is as follows:
inputting: inconsistent data set ES
Relationship example I
And (3) outputting: example I' after repair.
(3) Iterative detection and repair
For the repaired data, since the data repair process only repairs the currently generated error and cannot consider the global state, the current repair result may affect other data, thereby causing a new consistency contradiction, and therefore, the detection needs to be performed again after the repair.
The CNFD iterative cleaning framework algorithm is concretely as follows:
inputting: content dependent set of numerical conditional functions ∑
And (3) outputting: relationship example D after washingc′。
In order to verify the content correlation-based numerical data consistency cleaning method, two numerical real data sets 1) American Census information (Adults), 2) resident Income statistical information (Census-inner) and a manually extracted data set are given, and a generator is used for randomly selecting data and combining the data into tuples (hAdults) after the data are numerical. FIG. 5 shows a comparison of the run times of the present and conventional methods on the results of the Adults, Census-inclusions and hAdults datasets. During the detection process, the CNFD combines the rules, thereby reducing the detection times, so the detection time is relatively short. In the repair process, the number of errors detected by CNFD is more, and more reference values need to be considered in the repair process, so the repair time is increased.
FIG. 3 shows a comparison of the error detection rate of the present method and the conventional method over three data sets. Both NFD and CNFD have higher false detection rates overall. Since CNFD needs to detect with other data and rules at the time of error detection, the detection rate is slightly higher than CFD.
FIG. 4 shows a comparison of the error recovery rate over three data sets for the present method and the conventional method. The error recovery rate of CNFD is higher than that of the Voting method in the traditional method, because: CNFD needs to refer to other data, detecting more errors and repairing more accurately.

Claims (9)

1. A numerical data consistency cleaning method based on content correlation is characterized in that:
(1) using CNFD to discover and combine the data rule;
(2) using a CNFD-detect algorithm to detect inconsistent data in the data;
(3) repairing inconsistent data in the data by using a CNFD-repair algorithm;
(4) and carrying out detection and repair again on the repaired data.
2. The method of claim 1, wherein the method comprises: in step (1), the data is represented into an NFD dependency set by adopting an NFD rule design method, and rules with the same dependency format, namely related rules, are combined into a CNFD set.
3. The method for cleaning consistency of numerical data based on content correlation according to claim 1 or 2, wherein: using a CNFD-detect algorithm in the step (2) to calculate tuples which satisfy the CNFD sets in the input data relation examples, judging the consistency state of the calculated examples, and returning to an empty set if the consistency requirements are satisfied; if not, positioning the inconsistent data and outputting an inconsistent data set.
4. The method for cleaning consistency of numerical data based on content correlation according to claim 1 or 2, wherein: using a CNFD-repair algorithm in the step (3), inputting data into an inconsistent data set, firstly detecting a rule of repairing deadlock in the data during repairing, putting the rule of repairing deadlock together for detection, and endowing the same repairing target value; and directly repairing the error data without deadlock.
5. The method of claim 3, wherein the method comprises: using a CNFD-repair algorithm in the step (3), inputting data into an inconsistent data set, firstly detecting a rule of repairing deadlock in the data during repairing, putting the rule of repairing deadlock together for detection, and endowing the same repairing target value; and directly repairing the error data without deadlock.
6. The method for cleaning consistency of numerical data based on content correlation according to claim 1 or 2, wherein: and (4) adopting a CNFD iterative cleaning framework in the step (4), and carrying out iterative detection and repair on the repaired data until the error rate is lower than a threshold value, thereby ending the process.
7. The method of claim 3, wherein the method comprises: and (4) adopting a CNFD iterative cleaning framework in the step (4), and carrying out iterative detection and repair on the repaired data until the error rate is lower than a threshold value, thereby ending the process.
8. The method of claim 4, wherein the method comprises: and (4) adopting a CNFD iterative cleaning framework in the step (4), and carrying out iterative detection and repair on the repaired data until the error rate is lower than a threshold value, thereby ending the process.
9. The method of claim 5, wherein the method comprises: and (4) adopting a CNFD iterative cleaning framework in the step (4), and carrying out iterative detection and repair on the repaired data until the error rate is lower than a threshold value, thereby ending the process.
CN201911189468.9A 2019-11-28 2019-11-28 Content correlation-based numerical data consistency cleaning method Pending CN110968576A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911189468.9A CN110968576A (en) 2019-11-28 2019-11-28 Content correlation-based numerical data consistency cleaning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911189468.9A CN110968576A (en) 2019-11-28 2019-11-28 Content correlation-based numerical data consistency cleaning method

Publications (1)

Publication Number Publication Date
CN110968576A true CN110968576A (en) 2020-04-07

Family

ID=70031956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911189468.9A Pending CN110968576A (en) 2019-11-28 2019-11-28 Content correlation-based numerical data consistency cleaning method

Country Status (1)

Country Link
CN (1) CN110968576A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239126A (en) * 2021-05-11 2021-08-10 中国银行保险信息技术管理有限公司 Business activity information standardization scheme based on BOR method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514366A (en) * 2013-09-13 2014-01-15 中南大学 Urban air quality concentration monitoring missing data recovering method
US20150254308A1 (en) * 2014-03-10 2015-09-10 Zephyr Health, Inc. Record linkage algorithm for multi-structured data
CN106446091A (en) * 2016-09-13 2017-02-22 北京协力筑成金融信息服务股份有限公司 Preprocessing method and device for multi-source time series data
CN109634949A (en) * 2018-12-28 2019-04-16 浙江大学 A kind of blended data cleaning method based on more versions of data
CN110188103A (en) * 2019-05-27 2019-08-30 深圳乐信软件技术有限公司 Data reconciliation method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514366A (en) * 2013-09-13 2014-01-15 中南大学 Urban air quality concentration monitoring missing data recovering method
US20150254308A1 (en) * 2014-03-10 2015-09-10 Zephyr Health, Inc. Record linkage algorithm for multi-structured data
CN106446091A (en) * 2016-09-13 2017-02-22 北京协力筑成金融信息服务股份有限公司 Preprocessing method and device for multi-source time series data
CN109634949A (en) * 2018-12-28 2019-04-16 浙江大学 A kind of blended data cleaning method based on more versions of data
CN110188103A (en) * 2019-05-27 2019-08-30 深圳乐信软件技术有限公司 Data reconciliation method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOOU DING等: "Improve3C:Data Cleaning on Consistency and completeness with currency" *
余敏等: "基于依赖的数据一致性研究进展" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239126A (en) * 2021-05-11 2021-08-10 中国银行保险信息技术管理有限公司 Business activity information standardization scheme based on BOR method

Similar Documents

Publication Publication Date Title
Dijkman et al. Aligning business process models
CN108446540A (en) Program code based on source code multi-tag figure neural network plagiarizes type detection method and system
CN106156082B (en) A body alignment method and device
CN110377747B (en) Knowledge base fusion method for encyclopedic website
CN108959395B (en) Multi-source heterogeneous big data oriented hierarchical reduction combined cleaning method
CN107203468B (en) AST-based software version evolution comparative analysis method
Cheng et al. Rule-based graph repairing: Semantic and efficient repairing methods
Deng et al. Unsupervised string transformation learning for entity consolidation
Hao et al. Cleaning relations using knowledge bases
CN104137095A (en) System for evolutionary analytics
WO2021114483A1 (en) Method for automatically identifying design change in building information model
US20230126509A1 (en) Database management system and method for graph view selection for a relational-graph database
WO2014122295A2 (en) Methods and systems for data cleaning
CN107656978B (en) Function dependence-based diverse data restoration method
Singh et al. DELTA-LD: A change detection approach for linked datasets
US20240111736A1 (en) Semantic classification for data management
CN109634949B (en) Mixed data cleaning method based on multiple data versions
Ciszak Application of clustering and association methods in data cleaning
CN110968576A (en) Content correlation-based numerical data consistency cleaning method
Ni et al. Automatic data repair: Are we ready to deploy?
Ortona et al. Joint repairs for web wrappers
CN115730020B (en) Automatic driving data monitoring method and monitoring system based on MySQL database log analysis
CN107180024A (en) A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system
CN110321351A (en) A kind of vendor name method for normalizing based on fuzzy matching
CN115168085A (en) Repetitive conflict scheme detection method based on diff code block matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200407