CN110825788A

CN110825788A - Rule reduction method based on data quality detection rule mining result

Info

Publication number: CN110825788A
Application number: CN201911079988.4A
Authority: CN
Inventors: 唐雪飞; 黄永鑫
Original assignee: CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Current assignee: CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Priority date: 2019-11-07
Filing date: 2019-11-07
Publication date: 2020-02-21

Abstract

The invention discloses a rule reduction method based on a data quality detection rule mining result, which comprises the following steps: s1, mining based on the data quality detection rule; s2, carrying out rule reduction on the mining result, and comprising the following sub-steps: s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G; s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and assigning a weight 1 to each edge; s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm; and S24, reducing the existing data quality detection rule according to the shortest path obtained by each vertex. The invention can effectively excavate the data quality detection rules existing between the attributes, reduce the workload of designing and configuring the data quality detection rules by field experts, improve the working efficiency and reduce redundant rules, so that the semantics expressed by the finally generated data quality detection rules are more accurate.

Description

Rule reduction method based on data quality detection rule mining result

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a rule reduction method based on a data quality detection rule mining result.

Background

Quality is a measure of compliance, and for a tangible product, quality refers to the extent to which a set of inherent characteristics of the product meet requirements. While data is often considered intangible, its quality refers to the degree of compliance with authenticity, legitimacy and usability. The data quality detection rule is a key for detecting data quality, and is a mode for limiting data, knowledge and service range by using a definition method such as semantics, grammar and the like. The period of designing and configuring the data quality detection rules by field experts can be reduced by automatically discovering the data quality detection rules, the workload of the field experts is reduced, the working efficiency is improved, and the construction process of the data quality is accelerated.

With the importance of organization on data quality construction, mining of data quality detection rules has development potential more and more, but the data quality detection rules discovered by using related algorithms contain more redundant information, the workload of field experts may be increased under certain conditions, and ambiguity is easily generated when the data quality detection rules are used for subsequent data governance work. Therefore, how to reduce the rules based on the mining result of the data quality detection rules to generate the data quality detection rules which finally satisfy the conditions becomes a new development direction. Although some methods for mining data quality detection rules exist at present, redundant rules in rule sets are not reduced by combining actual service requirements and application scenarios on the basis of mining results, and from the long-term development point of view, the method cannot well adapt to the requirements of actual service scenarios, and is not beneficial to applying theoretical research results to actual services well.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a rule reduction method based on the mining result of the data quality detection rule, which can effectively mine the data quality detection rule existing between attributes, reduce the workload of designing and configuring the data quality detection rule by field experts, improve the working efficiency, reduce redundant rules and enable the semantics expressed by the finally generated data quality detection rule to be more accurate.

The purpose of the invention is realized by the following technical scheme: the rule reduction method based on the data quality detection rule mining result comprises the following steps:

s1, mining based on the data quality detection rule;

s2, carrying out rule reduction on the mining result, and comprising the following sub-steps:

s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G;

s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and assigning a weight 1 to each edge;

s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm;

and S24, reducing the existing data quality detection rules according to the shortest path obtained by each vertex, and removing redundant rules to obtain a final data quality detection rule set.

Further, the specific implementation method of step S1 is as follows: let relation mode be R, some example of R be R, attr (R) represents the set of all attributes of relation R, X is a certain attribute set on relation mode R, and A is a certain single attribute on relation mode R; let the expression of the mined data quality detection rule be FD: X → A, and find that FDs must satisfy the following two conditions:

minimum performance: means that if X → A holds, then Y → A does not hold for any one subset Y of X;

non-trivial: means that if X → A holds true, then attribute A does not belong to attribute set X;

step S1 includes the following substeps:

s11, scanning a database, and firstly, modeling all the prior values, namely the X set in X → A, on the relation R to obtain an attribute containing lattice; in the search, the layer L composed of single attributes is first₁Starting from L₁Layer is obtained as₂Layering until all attribute grids are searched; and L is_lThe number of attribute sets contained in the layer is l;

s12, if i is 1, it is the L-th_iEach attribute set X of a layer computes a set C⁺(X)：

S13, pruning: if it is not

Then for all supersets Y of X, there are

Since C (X) is empty, is composed ofIn a clear view of the above, it is known that,if not, then prove in the aggregateThe attribute A exists to ensure that X \ A } → A is established, the attribute set Y is a superset of the attribute set X, other attributes are added on the basis of X, and the dependence of X \ A } → A is also established; therefore, the dependence on the shape of Y \ A } → A also holds,

will not be empty, but this dependency is not minimal, since there is a subset X of Y such that X \ a } → a holds, and since the smallest functional dependency is sought by the TANE algorithm, there is no need to process set Y, so set Y is pruned;

s14, continuing to generate the L_i+1The layer returns to step S12 by setting i to i + 1.

The invention has the beneficial effects that: the invention automatically discovers the data quality detection rules in the data table by using the function dependence as the expression form of the data quality detection rules and reducing the redundancy rules on the basis, so that the discovered data quality detection rules have more practical value for the subsequent data management work, and provide basis for making corresponding decisions for data users while improving the data quality. The invention can effectively excavate the data quality detection rules existing between the attributes, reduce the workload of designing and configuring the data quality detection rules by field experts, improve the working efficiency, reduce the redundant rules, ensure that the semantics expressed by the finally generated data quality detection rules are more accurate, and provide powerful data support for making corresponding decisions for data users.

Drawings

FIG. 1 is a flow diagram of a rule reduction method based on data quality detection rule mining results;

FIG. 2 is a schematic diagram of an attribute containing grid according to the present invention;

FIG. 3 is a schematic diagram of an edge weighted graph G according to the present invention.

Detailed Description

Some terms to which the present invention relates are explained as follows:

1. function dependence

Function Dependency (FD) characterizes the relationship between attributes in a database relationship: that is, a function dependency characterizes the value of a certain attribute as being uniquely determined by several other attributes. For example, in an address database, the zip code is determined by the city and street address. Formally, the functional dependence on the relational schema R can be expressed as: x → A, wherein

A ∈ R, X is called the left part, A is called the right part. The condition that a certain functional dependency holds or is valid on a certain instance R of the relation R is: for all tuples t, u belonging to r, if t [ B ]]＝u[B]Then t [ A ]]＝u[A]It can also be said that t and u agree on the attribute set X and the attribute a. If the value of A does not depend on any subset of X, then the functional dependence of X → A is minimal. For example, if Y → A does not hold on r for any subset Y of X, then the functional dependency of X → A is minimal. In addition, if attribute A does not belong to attribute set X, then the functional dependence of X → A is nontrivial. The data quality detection rules are found from the given data table, i.e. for a given relation r, all the most significant ones of r are found to be trueSmall, non-trivial function dependence.

2. Data quality detection rule mining algorithm

Among algorithms for mining the data quality detection rule, the TANE algorithm is more commonly used. The TANE algorithm is a hierarchy-based algorithm that searches an attribute set containing lattice and obtains k +1 attributes from the k attributes. It first searches the attribute set containing only single attribute, and then searches the attribute set containing multiple attributes layer by layer through the structure of attribute search grid. When the algorithm is processing a certain attribute set X, it will detect whether a dependency like X \ A } → A (attribute A belongs to attribute set X) holds, which ensures that only non-trivial function dependencies will be considered. The small-to-large direction can also ensure that only the minimum dependence can be output, thereby being beneficial to carrying out efficient pruning operation.

TANE finds the smallest and non-trivial function dependence through the attribute set inclusion lattice. In order to detect the minimum dependency of X \ a } → a that may satisfy the requirement, it is necessary to determine whether Y \ a } → a holds for some property subsets Y of the property set X, and this information is stored in the right-hand candidate set c (Y) of Y.

If for a given property set X, there is a property A belonging to set C (X), then A will not depend on any appropriate subset of X. More precisely, the set of preliminary rhs candidates for set X is:

wherein,

to find the minimum dependency, it is necessary to detect whether X \ A } → A is satisfied, and in X \ A } → A, the attribute A belongs to the attribute set X, and for all the attributes B belonging to the attribute set X, there is A ∈ C (X \ B }). It is to be noted that it is preferable that,while

The functional dependency X \ B } \ { A } → A formed by the attribute A in the set is not minimal, because if X \ A } → A holds true, one attribute B can be removed from the remaining set of the left part X \ A } so that the dependency X \ B } \ { A } → A holds true. And a minimum function dependency requirement A ∈ C (X \ B }), that is, X \ A } → A, if true, cannot remove any attribute B in the left part of the dependency, otherwise X \ A } → A will not be true.

3. Dijkstra algorithm

Dijkstra's algorithm was proposed in 1959 by the netherlands computer scientist dikstra, and is therefore also called the dikstra algorithm. The method is a shortest path algorithm from one vertex to the rest of the vertices, and solves the shortest path problem in the weighted graph. The Dijkstra algorithm is mainly characterized in that the Dijkstra algorithm expands outwards layer by taking a starting point as a center until the expansion reaches a terminal point.

The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1, a rule reduction method based on a data quality detection rule mining result of the present invention includes the following steps:

s1, mining based on the data quality detection rule; the specific implementation method comprises the following steps: let relation mode be R, some example of R be R, attr (R) represents the set of all attributes of relation R, X is a certain attribute set on relation mode R, and A is a certain single attribute on relation mode R; let the expression of the mined data quality detection rule be FD: X → A, and find that FDs must satisfy the following two conditions:

step S1 includes the following substeps:

s11, scanning a database, and firstly, modeling all the antecedent values, namely the X set in X → A, on the relation R to obtain an attribute containing lattice, as shown in FIG. 2; in the search, the layer L composed of single attributes is first₁Starting from L₁Layer is obtained as₂Layering until all attribute grids are searched; and L is_lThe number of attribute sets contained in the layer is l;

S13, pruning: if it is not

Then for all supersets Y of X, there are

Since C (X) is empty, is composed of

In a clear view of the above, it is known that,

if not, then prove in the aggregate

The attribute A exists to ensure that X \ A } → A is established, the attribute set Y is a superset of the attribute set X, other attributes are added on the basis of X, and the dependence of X \ A } → A is also established; therefore, the dependence on the shape of Y \ A } → A also holds,

will not be empty, but this dependency is not minimal because there is a subset X of Y such that X \ A } → A holds, and because the TANE algorithm is to find that it is minimalSo there is no need to process set Y, so set Y is pruned;

The TANE algorithm is based on the idea of partitioning attribute values of a row set, so that the effectiveness of function dependence can be quickly tested even if a large number of tuples are contained in the relation, and further the function dependence meeting the conditions existing in a data table can be quickly found out.

s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and giving a weight 1 to each edge, as shown in FIG. 3;

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. The rule reduction method based on the data quality detection rule mining result is characterized by comprising the following steps of:

s1, mining based on the data quality detection rule;

2. The method for reducing rules based on mining results of data quality detection rules according to claim 1, wherein the step S1 is implemented by: let relation mode be R, some example of R be R, attr (R) represents the set of all attributes of relation R, X is a certain attribute set on relation mode R, and A is a certain single attribute on relation mode R; let the expression of the mined data quality detection rule be FD: X → A, and find that FDs must satisfy the following two conditions:

step S1 includes the following substeps:

S13, pruning: if it is not

Then for all supersets Y of X, there are

Since C (X) is empty, is composed of

In a clear view of the above, it is known that,

if not, then prove in the aggregate