CN110825788A - Rule reduction method based on data quality detection rule mining result - Google Patents
Rule reduction method based on data quality detection rule mining result Download PDFInfo
- Publication number
- CN110825788A CN110825788A CN201911079988.4A CN201911079988A CN110825788A CN 110825788 A CN110825788 A CN 110825788A CN 201911079988 A CN201911079988 A CN 201911079988A CN 110825788 A CN110825788 A CN 110825788A
- Authority
- CN
- China
- Prior art keywords
- attribute
- data quality
- quality detection
- rule
- attributes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 48
- 238000005065 mining Methods 0.000 title claims abstract description 24
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000013138 pruning Methods 0.000 claims description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000013523 data management Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a rule reduction method based on a data quality detection rule mining result, which comprises the following steps: s1, mining based on the data quality detection rule; s2, carrying out rule reduction on the mining result, and comprising the following sub-steps: s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G; s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and assigning a weight 1 to each edge; s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm; and S24, reducing the existing data quality detection rule according to the shortest path obtained by each vertex. The invention can effectively excavate the data quality detection rules existing between the attributes, reduce the workload of designing and configuring the data quality detection rules by field experts, improve the working efficiency and reduce redundant rules, so that the semantics expressed by the finally generated data quality detection rules are more accurate.
Description
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to a rule reduction method based on a data quality detection rule mining result.
Background
Quality is a measure of compliance, and for a tangible product, quality refers to the extent to which a set of inherent characteristics of the product meet requirements. While data is often considered intangible, its quality refers to the degree of compliance with authenticity, legitimacy and usability. The data quality detection rule is a key for detecting data quality, and is a mode for limiting data, knowledge and service range by using a definition method such as semantics, grammar and the like. The period of designing and configuring the data quality detection rules by field experts can be reduced by automatically discovering the data quality detection rules, the workload of the field experts is reduced, the working efficiency is improved, and the construction process of the data quality is accelerated.
With the importance of organization on data quality construction, mining of data quality detection rules has development potential more and more, but the data quality detection rules discovered by using related algorithms contain more redundant information, the workload of field experts may be increased under certain conditions, and ambiguity is easily generated when the data quality detection rules are used for subsequent data governance work. Therefore, how to reduce the rules based on the mining result of the data quality detection rules to generate the data quality detection rules which finally satisfy the conditions becomes a new development direction. Although some methods for mining data quality detection rules exist at present, redundant rules in rule sets are not reduced by combining actual service requirements and application scenarios on the basis of mining results, and from the long-term development point of view, the method cannot well adapt to the requirements of actual service scenarios, and is not beneficial to applying theoretical research results to actual services well.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a rule reduction method based on the mining result of the data quality detection rule, which can effectively mine the data quality detection rule existing between attributes, reduce the workload of designing and configuring the data quality detection rule by field experts, improve the working efficiency, reduce redundant rules and enable the semantics expressed by the finally generated data quality detection rule to be more accurate.
The purpose of the invention is realized by the following technical scheme: the rule reduction method based on the data quality detection rule mining result comprises the following steps:
s1, mining based on the data quality detection rule;
s2, carrying out rule reduction on the mining result, and comprising the following sub-steps:
s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G;
s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and assigning a weight 1 to each edge;
s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm;
and S24, reducing the existing data quality detection rules according to the shortest path obtained by each vertex, and removing redundant rules to obtain a final data quality detection rule set.
Further, the specific implementation method of step S1 is as follows: let relation mode be R, some example of R be R, attr (R) represents the set of all attributes of relation R, X is a certain attribute set on relation mode R, and A is a certain single attribute on relation mode R; let the expression of the mined data quality detection rule be FD: X → A, and find that FDs must satisfy the following two conditions:
minimum performance: means that if X → A holds, then Y → A does not hold for any one subset Y of X;
non-trivial: means that if X → A holds true, then attribute A does not belong to attribute set X;
step S1 includes the following substeps:
s11, scanning a database, and firstly, modeling all the prior values, namely the X set in X → A, on the relation R to obtain an attribute containing lattice; in the search, the layer L composed of single attributes is first1Starting from L1Layer is obtained as2Layering until all attribute grids are searched; and L islThe number of attribute sets contained in the layer is l;
s12, if i is 1, it is the L-thiEach attribute set X of a layer computes a set C+(X):
S13, pruning: if it is notThen for all supersets Y of X, there areSince C (X) is empty, is composed ofIn a clear view of the above, it is known that,if not, then prove in the aggregateThe attribute A exists to ensure that X \ A } → A is established, the attribute set Y is a superset of the attribute set X, other attributes are added on the basis of X, and the dependence of X \ A } → A is also established; therefore, the dependence on the shape of Y \ A } → A also holds,will not be empty, but this dependency is not minimal, since there is a subset X of Y such that X \ a } → a holds, and since the smallest functional dependency is sought by the TANE algorithm, there is no need to process set Y, so set Y is pruned;
s14, continuing to generate the Li+1The layer returns to step S12 by setting i to i + 1.
The invention has the beneficial effects that: the invention automatically discovers the data quality detection rules in the data table by using the function dependence as the expression form of the data quality detection rules and reducing the redundancy rules on the basis, so that the discovered data quality detection rules have more practical value for the subsequent data management work, and provide basis for making corresponding decisions for data users while improving the data quality. The invention can effectively excavate the data quality detection rules existing between the attributes, reduce the workload of designing and configuring the data quality detection rules by field experts, improve the working efficiency, reduce the redundant rules, ensure that the semantics expressed by the finally generated data quality detection rules are more accurate, and provide powerful data support for making corresponding decisions for data users.
Drawings
FIG. 1 is a flow diagram of a rule reduction method based on data quality detection rule mining results;
FIG. 2 is a schematic diagram of an attribute containing grid according to the present invention;
FIG. 3 is a schematic diagram of an edge weighted graph G according to the present invention.
Detailed Description
Some terms to which the present invention relates are explained as follows:
1. function dependence
Function Dependency (FD) characterizes the relationship between attributes in a database relationship: that is, a function dependency characterizes the value of a certain attribute as being uniquely determined by several other attributes. For example, in an address database, the zip code is determined by the city and street address. Formally, the functional dependence on the relational schema R can be expressed as: x → A, whereinA ∈ R, X is called the left part, A is called the right part. The condition that a certain functional dependency holds or is valid on a certain instance R of the relation R is: for all tuples t, u belonging to r, if t [ B ]]=u[B]Then t [ A ]]=u[A]It can also be said that t and u agree on the attribute set X and the attribute a. If the value of A does not depend on any subset of X, then the functional dependence of X → A is minimal. For example, if Y → A does not hold on r for any subset Y of X, then the functional dependency of X → A is minimal. In addition, if attribute A does not belong to attribute set X, then the functional dependence of X → A is nontrivial. The data quality detection rules are found from the given data table, i.e. for a given relation r, all the most significant ones of r are found to be trueSmall, non-trivial function dependence.
2. Data quality detection rule mining algorithm
Among algorithms for mining the data quality detection rule, the TANE algorithm is more commonly used. The TANE algorithm is a hierarchy-based algorithm that searches an attribute set containing lattice and obtains k +1 attributes from the k attributes. It first searches the attribute set containing only single attribute, and then searches the attribute set containing multiple attributes layer by layer through the structure of attribute search grid. When the algorithm is processing a certain attribute set X, it will detect whether a dependency like X \ A } → A (attribute A belongs to attribute set X) holds, which ensures that only non-trivial function dependencies will be considered. The small-to-large direction can also ensure that only the minimum dependence can be output, thereby being beneficial to carrying out efficient pruning operation.
TANE finds the smallest and non-trivial function dependence through the attribute set inclusion lattice. In order to detect the minimum dependency of X \ a } → a that may satisfy the requirement, it is necessary to determine whether Y \ a } → a holds for some property subsets Y of the property set X, and this information is stored in the right-hand candidate set c (Y) of Y.
If for a given property set X, there is a property A belonging to set C (X), then A will not depend on any appropriate subset of X. More precisely, the set of preliminary rhs candidates for set X is:wherein,to find the minimum dependency, it is necessary to detect whether X \ A } → A is satisfied, and in X \ A } → A, the attribute A belongs to the attribute set X, and for all the attributes B belonging to the attribute set X, there is A ∈ C (X \ B }). It is to be noted that it is preferable that,while The functional dependency X \ B } \ { A } → A formed by the attribute A in the set is not minimal, because if X \ A } → A holds true, one attribute B can be removed from the remaining set of the left part X \ A } so that the dependency X \ B } \ { A } → A holds true. And a minimum function dependency requirement A ∈ C (X \ B }), that is, X \ A } → A, if true, cannot remove any attribute B in the left part of the dependency, otherwise X \ A } → A will not be true.
3. Dijkstra algorithm
Dijkstra's algorithm was proposed in 1959 by the netherlands computer scientist dikstra, and is therefore also called the dikstra algorithm. The method is a shortest path algorithm from one vertex to the rest of the vertices, and solves the shortest path problem in the weighted graph. The Dijkstra algorithm is mainly characterized in that the Dijkstra algorithm expands outwards layer by taking a starting point as a center until the expansion reaches a terminal point.
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, a rule reduction method based on a data quality detection rule mining result of the present invention includes the following steps:
s1, mining based on the data quality detection rule; the specific implementation method comprises the following steps: let relation mode be R, some example of R be R, attr (R) represents the set of all attributes of relation R, X is a certain attribute set on relation mode R, and A is a certain single attribute on relation mode R; let the expression of the mined data quality detection rule be FD: X → A, and find that FDs must satisfy the following two conditions:
minimum performance: means that if X → A holds, then Y → A does not hold for any one subset Y of X;
non-trivial: means that if X → A holds true, then attribute A does not belong to attribute set X;
step S1 includes the following substeps:
s11, scanning a database, and firstly, modeling all the antecedent values, namely the X set in X → A, on the relation R to obtain an attribute containing lattice, as shown in FIG. 2; in the search, the layer L composed of single attributes is first1Starting from L1Layer is obtained as2Layering until all attribute grids are searched; and L islThe number of attribute sets contained in the layer is l;
s12, if i is 1, it is the L-thiEach attribute set X of a layer computes a set C+(X):
S13, pruning: if it is notThen for all supersets Y of X, there areSince C (X) is empty, is composed ofIn a clear view of the above, it is known that,if not, then prove in the aggregateThe attribute A exists to ensure that X \ A } → A is established, the attribute set Y is a superset of the attribute set X, other attributes are added on the basis of X, and the dependence of X \ A } → A is also established; therefore, the dependence on the shape of Y \ A } → A also holds,will not be empty, but this dependency is not minimal because there is a subset X of Y such that X \ A } → A holds, and because the TANE algorithm is to find that it is minimalSo there is no need to process set Y, so set Y is pruned;
s14, continuing to generate the Li+1The layer returns to step S12 by setting i to i + 1.
The TANE algorithm is based on the idea of partitioning attribute values of a row set, so that the effectiveness of function dependence can be quickly tested even if a large number of tuples are contained in the relation, and further the function dependence meeting the conditions existing in a data table can be quickly found out.
S2, carrying out rule reduction on the mining result, and comprising the following sub-steps:
s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G;
s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and giving a weight 1 to each edge, as shown in FIG. 3;
s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm;
and S24, reducing the existing data quality detection rules according to the shortest path obtained by each vertex, and removing redundant rules to obtain a final data quality detection rule set.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.
Claims (2)
1. The rule reduction method based on the data quality detection rule mining result is characterized by comprising the following steps of:
s1, mining based on the data quality detection rule;
s2, carrying out rule reduction on the mining result, and comprising the following sub-steps:
s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G;
s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and assigning a weight 1 to each edge;
s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm;
and S24, reducing the existing data quality detection rules according to the shortest path obtained by each vertex, and removing redundant rules to obtain a final data quality detection rule set.
2. The method for reducing rules based on mining results of data quality detection rules according to claim 1, wherein the step S1 is implemented by: let relation mode be R, some example of R be R, attr (R) represents the set of all attributes of relation R, X is a certain attribute set on relation mode R, and A is a certain single attribute on relation mode R; let the expression of the mined data quality detection rule be FD: X → A, and find that FDs must satisfy the following two conditions:
minimum performance: means that if X → A holds, then Y → A does not hold for any one subset Y of X;
non-trivial: means that if X → A holds true, then attribute A does not belong to attribute set X;
step S1 includes the following substeps:
s11, scanning a database, and firstly, modeling all the prior values, namely the X set in X → A, on the relation R to obtain an attribute containing lattice; in the search, the layer L composed of single attributes is first1Starting from L1Layer is obtained as2Layering until all attribute grids are searched; and L islThe number of attribute sets contained in the layer is l;
s12, if i is 1, it is the L-thiEach attribute set X of a layer computes a set C+(X):
S13, pruning: if it is notThen for all supersets Y of X, there areSince C (X) is empty, is composed ofIn a clear view of the above, it is known that,if not, then prove in the aggregateThe attribute A exists to ensure that X \ A } → A is established, the attribute set Y is a superset of the attribute set X, other attributes are added on the basis of X, and the dependence of X \ A } → A is also established; therefore, the dependence on the shape of Y \ A } → A also holds,will not be empty, but this dependency is not minimal, since there is a subset X of Y such that X \ a } → a holds, and since the smallest functional dependency is sought by the TANE algorithm, there is no need to process set Y, so set Y is pruned;
s14, continuing to generate the Li+1The layer returns to step S12 by setting i to i + 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911079988.4A CN110825788A (en) | 2019-11-07 | 2019-11-07 | Rule reduction method based on data quality detection rule mining result |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911079988.4A CN110825788A (en) | 2019-11-07 | 2019-11-07 | Rule reduction method based on data quality detection rule mining result |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110825788A true CN110825788A (en) | 2020-02-21 |
Family
ID=69553235
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911079988.4A Pending CN110825788A (en) | 2019-11-07 | 2019-11-07 | Rule reduction method based on data quality detection rule mining result |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110825788A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115994194A (en) * | 2023-03-23 | 2023-04-21 | 河北东软软件有限公司 | Method, system, equipment and medium for checking data quality of government affair big data |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010051947A1 (en) * | 2000-05-02 | 2001-12-13 | International Business Machines Corporation | Spatial data mining method, spatial data mining apparatus and storage medium |
CN102163300A (en) * | 2011-04-20 | 2011-08-24 | 南京航空航天大学 | Method for optimizing fault diagnosis rules based on ant colony optimization algorithm |
CN106294715A (en) * | 2016-08-09 | 2017-01-04 | 中国地质大学(武汉) | A kind of association rule mining method based on attribute reduction and device |
CN109614491A (en) * | 2018-12-21 | 2019-04-12 | 成都康赛信息技术有限公司 | Further method for digging based on data quality checking rule digging result |
CN110097934A (en) * | 2018-12-29 | 2019-08-06 | 蚌埠医学院 | A kind of attributive character reduction method of electrocardio Ontological concept |
-
2019
- 2019-11-07 CN CN201911079988.4A patent/CN110825788A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010051947A1 (en) * | 2000-05-02 | 2001-12-13 | International Business Machines Corporation | Spatial data mining method, spatial data mining apparatus and storage medium |
CN102163300A (en) * | 2011-04-20 | 2011-08-24 | 南京航空航天大学 | Method for optimizing fault diagnosis rules based on ant colony optimization algorithm |
CN106294715A (en) * | 2016-08-09 | 2017-01-04 | 中国地质大学(武汉) | A kind of association rule mining method based on attribute reduction and device |
CN109614491A (en) * | 2018-12-21 | 2019-04-12 | 成都康赛信息技术有限公司 | Further method for digging based on data quality checking rule digging result |
CN110097934A (en) * | 2018-12-29 | 2019-08-06 | 蚌埠医学院 | A kind of attributive character reduction method of electrocardio Ontological concept |
Non-Patent Citations (3)
Title |
---|
LEILEI CHANG等: "Structure learning for belief rule base expert system: A comparative study", 《ELSEVIER》 * |
杨隆浩等: "基于关联系数标准差融合的置信规则库规则约简方法", 《信息与控制》 * |
王雪珊: "数据质量校验规则提取技术的研究", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115994194A (en) * | 2023-03-23 | 2023-04-21 | 河北东软软件有限公司 | Method, system, equipment and medium for checking data quality of government affair big data |
CN115994194B (en) * | 2023-03-23 | 2023-06-02 | 河北东软软件有限公司 | Method, system, equipment and medium for checking data quality of government affair big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101105363B1 (en) | Finding Frequent Item Sets for Long Transaction Data Streams | |
CN107229751A (en) | A kind of concurrent incremental formula association rule mining method towards stream data | |
CN110069500B (en) | Dynamic mixed indexing method for non-relational database | |
Min et al. | Symmetric continuous subgraph matching with bidirectional dynamic programming | |
CN105117488B (en) | A kind of distributed storage RDF data balanced division method based on hybrid hierarchy cluster | |
CN109635037B (en) | Shard storage method and device for a relational distributed database | |
CN104408584A (en) | Analysis method and system for transaction relevance | |
CN112925789B (en) | Spark-based space vector data memory storage query method and system | |
CN112699134A (en) | Distributed graph database storage and query method based on graph subdivision | |
CN116521956A (en) | A graph database query method, device, electronic equipment and storage medium | |
Dhulipala et al. | Terahac: Hierarchical agglomerative clustering of trillion-edge graphs | |
Cao et al. | An improved method to build the KD tree based on presorted results | |
CN102043857B (en) | All-nearest-neighbor query method and system | |
CN109697206A (en) | A kind of distributive function dependence method for digging | |
CN110825788A (en) | Rule reduction method based on data quality detection rule mining result | |
Naresh et al. | Implementation of dynamic and fast mining algorithms on incremental datasets to discover qualitative rules | |
CN115114464A (en) | Power grid graph database storage method based on multi-Hash algorithm | |
Lasek et al. | Density-based clustering with constraints | |
Lin et al. | Mining of high average-utility patterns with item-level thresholds | |
CN111107493B (en) | Method and system for predicting position of mobile user | |
CN105740371A (en) | Density-based incremental clustering data mining method and system | |
CN117370613A (en) | Method and device for accelerating shortest path for large-scale graph data | |
Cao et al. | Research on searching algorithms for unstructured grid remapping based on KD tree | |
CN110807061A (en) | Method for searching frequent subgraphs of uncertain graphs based on layering | |
CN114116785A (en) | A Distributed SPARQL Query Optimization Method Based on Minimum Attribute Cut |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200221 |
|
RJ01 | Rejection of invention patent application after publication |