Content correlation-based numerical data consistency cleaning method
Technical Field
The invention relates to a data consistency maintenance method of associated data, in particular to a numerical data consistency cleaning method.
Background
With the development of information technology and the popularization of computer application, especially the advance of network technology, such as the appearance of Web2.0 technology, people have come to the "big data age". Data quality management is a classic problem in the field of databases. Data quality is always a topic of much interest, both in traditional databases and in "big data" environments. A quality database system should be able to provide a certain, maintainable quality data information based on the needs of the various queries. The data information can truly and accurately reflect the actual condition of the objective world.
However, real-life data often presents data quality problems, and such data is often also referred to as "dirty data," e.g., incomplete, inconsistent, redundant, stale, etc. The dirty data not only affects the overall quality of the data, but also affects data analysis (such as data mining, OLAP analysis, etc.), eventually leading to decision errors and bringing great economic loss to people. According to the investigation of a German professional data analysis agency, "losses due to poor quality data in the United states are as high as $ 6000 billion per year. Therefore, it is very important to improve the data quality starting from the solution of "dirty data".
The data quality management comprises five contents of data consistency, data redundancy, data integrity, data accuracy and data timeliness. Data consistency describes the availability and correctness of objective entities, and data meeting the consistency can avoid data conflict and semantic ambiguity. Among them, the use of rule constraints for data consistency maintenance is the most effective technical means, such as Function Dependence (FD), conditional function dependence (NFD), and Numerical Function Dependence (NFD) on numerical data. NFDs can catch errors in the numerical data, but there are still some potential errors in the numerical data that cannot be detected by the NFD, which in turn affects the overall quality of the data. Therefore, a rule constraint that can detect and repair these potential errors is highly desirable.
The content expressed between different data has an association relationship, and the association relationship can be used for detecting potential errors in the data and further repairing the potential errors. The association relationship between the data is also reflected in the constraint rule related to the data, so that the NFDs are also associated with each other.
Current research work on data quality management of associated data mainly includes error discovery and error repair:
false discovery refers to the discovery of data that has quality problems. The scholars at home and abroad deeply research the error discovery and obtain important achievements. Among them, the most famous and most effective classification methods in the existing literature mainly include: entity identification, a distance function for entity identification, attributing data meeting distance requirements to corresponding entities; error discovery based on rule constraints-for different error types, researchers propose a variety of rule constraints, including function dependence, conditional function dependence, containment constraints, aging constraints, negative constraints, and the like; and (3) based on error discovery of the main data, namely matching the data to be processed with the data source by using a matching rule, and further carrying out data detection and repair.
Data repair refers to a data operation process for repairing detected data. Error repair methods can be classified into the following three types according to different repair strategies: the method uses a function dependence to maintain the consistency of data, uses a heuristic method to match data of a character string type in the repairing process, and selects a strategy with the minimum cost to repair; finding a true value, namely calculating the confidence coefficient of data by using a Bayesian model, and selecting the data meeting the maximum posterior probability as a true value; machine learning based repair — researchers get repair target values by using some existing machine learning methods (such as decision trees, bayesian networks, neural networks, etc.) on a training set.
Disclosure of Invention
The invention aims to provide a method for cleaning the consistency of numerical data based on content correlation, which has high error detection rate and correction accuracy.
The purpose of the invention is realized as follows:
(1) using CNFD to discover and combine the data rule;
(2) using a CNFD-detect algorithm to detect inconsistent data in the data;
(3) repairing inconsistent data in the data by using a CNFD-repair algorithm;
(4) and carrying out detection and repair again on the repaired data.
The invention may also include such features:
1. in step (1), the data is represented into an NFD dependency set by adopting an NFD rule design method, and rules with the same dependency format, namely related rules, are combined into a CNFD set.
2. Using a CNFD-detect algorithm in the step (2) to calculate tuples which satisfy the CNFD sets in the input data relation examples, judging the consistency state of the calculated examples, and returning to an empty set if the consistency requirements are satisfied; if not, positioning the inconsistent data and outputting an inconsistent data set.
3. Using a CNFD-repair algorithm in the step (3), inputting data into an inconsistent data set, firstly detecting a rule of repairing deadlock in the data during repairing, putting the rule of repairing deadlock together for detection, and endowing the same repairing target value; and directly repairing the error data without deadlock.
4. And (4) adopting a CNFD iterative cleaning framework in the step (4), and carrying out iterative detection and repair on the repaired data until the error rate is lower than a threshold value, thereby ending the process.
The invention provides a numerical data consistency cleaning method based on content correlation. The main characteristics include: (1) a CNFD-detect algorithm is provided for detecting the consistency of data; (2) a CNFD-repair algorithm is proposed to repair inconsistent data; (3) and after repairing, performing iterative detection and repairing again. The invention provides a numerical data consistency cleaning method based on content correlation, aiming at repairing numerical data with higher accuracy. Detecting inconsistent data by adopting a CNFD-detect algorithm, judging whether the rules are possible to deadlock or not, if so, putting the rules which are possible to deadlock together, assigning values uniformly and then repairing, and if not, directly repairing; and (4) detecting the repaired data iteration so as to solve the problem that the repair generates new inconsistency. The invention shows superiority and excellent adaptability in the aspect of data consistency cleaning direction, no matter error detection rate or error repair rate, or practical application.
Compared with the prior art, the invention has the advantages that: a. the CNFD combines the rules, so that the detection times are reduced, and the consumed running time is shorter; b. when the error detection is carried out on the data, the data under other conditions needs to be used for detection, so the error detection rate is higher; c. the invention has higher correction accuracy rate when carrying out error repair. Therefore, the invention provides the data cleaning by using the content-related numerical function dependence, can achieve higher detection accuracy and error recovery rate by detecting related content, and has expansibility on content-related numerical data sets.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIGS. 2(a) through 2(c) are run-time comparisons of the present method with conventional methods, FIG. 2(a) being run-time on a digitized results dataset, FIG. 2(b) being run-time on a digitized Census-inner dataset, and FIG. 2(c) being run-time on an hadlets dataset;
FIG. 3 is a comparison of error detection rates of the present method and the conventional method with respect to error data ratios;
FIG. 4 is a comparison of error repair accuracy for the error data ratio between the present method and the conventional method;
FIG. 5 is a process flow of relationship example E after washing;
FIG. 6 is a process flow of example I' after repair;
FIG. 7 is a relation example D after cleaningcThe process flow of' is described.
Detailed Description
The invention relates to a content-correlation-based numerical data consistency cleaning method, which mainly comprises the following steps:
(1) consistency check (CNFD-detect)
The input data are a relation example and a CNFD set, tuples meeting the CNFD set in the relation example are calculated, the consistency state of the calculated example is judged, and if the consistency requirement is met, an empty set is returned; if not, positioning the inconsistent data and outputting an inconsistent data set.
(2) Consistency repair (CNFD-repair)
Inputting an inconsistent data set, firstly detecting a rule of repairing deadlock possibly existing in data during repairing, putting the rule together for detection, and endowing the same repairing target value; and directly repairing the error data without deadlock.
(3) Iterative detection and repair
For the repaired data, since the data repair process only repairs the currently generated error and cannot consider the global state, the current repair result may affect other data, thereby causing a new consistency contradiction, and therefore, the detection needs to be performed again after the repair.
The invention is described in more detail below by way of example.
With reference to fig. 1, the specific steps of the present invention are as follows:
(1) consistency check (CNFD-detect)
After writing and modifying the numerical function dependence into the conditional function dependence format Y → a, a definition of content-dependent numerical function dependence is given herein.
A content dependent numerical function dependence (CNFD) of 1 is defined.
A content dependent numerical function dependence on R of Ψ
Ψ:(C|Y→A,Sc) (definition 1)
Where C is a condition attribute set, Y is a variable attribute set, C and Y are separated by "|",
and is
C ∪ Y represents the left part of rule Ψ, denoted LHS (Ψ). A represents the right part of rule Ψ, and A ∈ attr (R), denoted RHS (R).
Content-dependent numerical function dependencies are derived from a combination of numerical function dependencies, and numerical function dependencies having the same (C | Y → a) form as candidate rules for the combination result in a CNFD set.
The input data are a relation example and a CNFD set, tuples meeting the CNFD set in the relation example are calculated, the consistency state of the calculated example is judged, and if the consistency requirement is met, an empty set is returned; if not, positioning the inconsistent data and outputting an inconsistent data set.
The CNFD-detect algorithm is specifically as follows:
inputting: relationship example I
Content dependent set of numerical condition functions Ψ
And (3) outputting: relation example E after washing.
(2) Consistency repair (CNFD-repair)
Inputting an inconsistent data set, firstly detecting a rule of repairing deadlock possibly existing in data during repairing, putting the rule together for detection, and endowing the same repairing target value; and directly repairing the error data without deadlock.
The CNFD-repair algorithm is as follows:
inputting: inconsistent data set ES
Relationship example I
And (3) outputting: example I' after repair.
(3) Iterative detection and repair
For the repaired data, since the data repair process only repairs the currently generated error and cannot consider the global state, the current repair result may affect other data, thereby causing a new consistency contradiction, and therefore, the detection needs to be performed again after the repair.
The CNFD iterative cleaning framework algorithm is concretely as follows:
inputting: content dependent set of numerical conditional functions ∑
And (3) outputting: relationship example D after washingc′。
In order to verify the content correlation-based numerical data consistency cleaning method, two numerical real data sets 1) American Census information (Adults), 2) resident Income statistical information (Census-inner) and a manually extracted data set are given, and a generator is used for randomly selecting data and combining the data into tuples (hAdults) after the data are numerical. FIG. 5 shows a comparison of the run times of the present and conventional methods on the results of the Adults, Census-inclusions and hAdults datasets. During the detection process, the CNFD combines the rules, thereby reducing the detection times, so the detection time is relatively short. In the repair process, the number of errors detected by CNFD is more, and more reference values need to be considered in the repair process, so the repair time is increased.
FIG. 3 shows a comparison of the error detection rate of the present method and the conventional method over three data sets. Both NFD and CNFD have higher false detection rates overall. Since CNFD needs to detect with other data and rules at the time of error detection, the detection rate is slightly higher than CFD.
FIG. 4 shows a comparison of the error recovery rate over three data sets for the present method and the conventional method. The error recovery rate of CNFD is higher than that of the Voting method in the traditional method, because: CNFD needs to refer to other data, detecting more errors and repairing more accurately.