[go: up one dir, main page]

CN110825788A - Rule reduction method based on data quality detection rule mining result - Google Patents

Rule reduction method based on data quality detection rule mining result Download PDF

Info

Publication number
CN110825788A
CN110825788A CN201911079988.4A CN201911079988A CN110825788A CN 110825788 A CN110825788 A CN 110825788A CN 201911079988 A CN201911079988 A CN 201911079988A CN 110825788 A CN110825788 A CN 110825788A
Authority
CN
China
Prior art keywords
attribute
data quality
quality detection
rule
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911079988.4A
Other languages
Chinese (zh)
Inventor
唐雪飞
黄永鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Original Assignee
CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd filed Critical CHENGDU COMSYS INFORMATION TECHNOLOGY Co Ltd
Priority to CN201911079988.4A priority Critical patent/CN110825788A/en
Publication of CN110825788A publication Critical patent/CN110825788A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a rule reduction method based on a data quality detection rule mining result, which comprises the following steps: s1, mining based on the data quality detection rule; s2, carrying out rule reduction on the mining result, and comprising the following sub-steps: s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G; s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and assigning a weight 1 to each edge; s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm; and S24, reducing the existing data quality detection rule according to the shortest path obtained by each vertex. The invention can effectively excavate the data quality detection rules existing between the attributes, reduce the workload of designing and configuring the data quality detection rules by field experts, improve the working efficiency and reduce redundant rules, so that the semantics expressed by the finally generated data quality detection rules are more accurate.

Description

Rule reduction method based on data quality detection rule mining result
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to a rule reduction method based on a data quality detection rule mining result.
Background
Quality is a measure of compliance, and for a tangible product, quality refers to the extent to which a set of inherent characteristics of the product meet requirements. While data is often considered intangible, its quality refers to the degree of compliance with authenticity, legitimacy and usability. The data quality detection rule is a key for detecting data quality, and is a mode for limiting data, knowledge and service range by using a definition method such as semantics, grammar and the like. The period of designing and configuring the data quality detection rules by field experts can be reduced by automatically discovering the data quality detection rules, the workload of the field experts is reduced, the working efficiency is improved, and the construction process of the data quality is accelerated.
With the importance of organization on data quality construction, mining of data quality detection rules has development potential more and more, but the data quality detection rules discovered by using related algorithms contain more redundant information, the workload of field experts may be increased under certain conditions, and ambiguity is easily generated when the data quality detection rules are used for subsequent data governance work. Therefore, how to reduce the rules based on the mining result of the data quality detection rules to generate the data quality detection rules which finally satisfy the conditions becomes a new development direction. Although some methods for mining data quality detection rules exist at present, redundant rules in rule sets are not reduced by combining actual service requirements and application scenarios on the basis of mining results, and from the long-term development point of view, the method cannot well adapt to the requirements of actual service scenarios, and is not beneficial to applying theoretical research results to actual services well.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a rule reduction method based on the mining result of the data quality detection rule, which can effectively mine the data quality detection rule existing between attributes, reduce the workload of designing and configuring the data quality detection rule by field experts, improve the working efficiency, reduce redundant rules and enable the semantics expressed by the finally generated data quality detection rule to be more accurate.
The purpose of the invention is realized by the following technical scheme: the rule reduction method based on the data quality detection rule mining result comprises the following steps:
s1, mining based on the data quality detection rule;
s2, carrying out rule reduction on the mining result, and comprising the following sub-steps:
s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G;
s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and assigning a weight 1 to each edge;
s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm;
and S24, reducing the existing data quality detection rules according to the shortest path obtained by each vertex, and removing redundant rules to obtain a final data quality detection rule set.
Further, the specific implementation method of step S1 is as follows: let relation mode be R, some example of R be R, attr (R) represents the set of all attributes of relation R, X is a certain attribute set on relation mode R, and A is a certain single attribute on relation mode R; let the expression of the mined data quality detection rule be FD: X → A, and find that FDs must satisfy the following two conditions:
minimum performance: means that if X → A holds, then Y → A does not hold for any one subset Y of X;
non-trivial: means that if X → A holds true, then attribute A does not belong to attribute set X;
step S1 includes the following substeps:
s11, scanning a database, and firstly, modeling all the prior values, namely the X set in X → A, on the relation R to obtain an attribute containing lattice; in the search, the layer L composed of single attributes is first1Starting from L1Layer is obtained as2Layering until all attribute grids are searched; and L islThe number of attribute sets contained in the layer is l;
s12, if i is 1, it is the L-thiEach attribute set X of a layer computes a set C+(X):
S13, pruning: if it is not
Figure BDA0002263651830000022
Then for all supersets Y of X, there are
Figure BDA0002263651830000023
Since C (X) is empty, is composed ofIn a clear view of the above, it is known that,if not, then prove in the aggregateThe attribute A exists to ensure that X \ A } → A is established, the attribute set Y is a superset of the attribute set X, other attributes are added on the basis of X, and the dependence of X \ A } → A is also established; therefore, the dependence on the shape of Y \ A } → A also holds,
Figure BDA0002263651830000027
will not be empty, but this dependency is not minimal, since there is a subset X of Y such that X \ a } → a holds, and since the smallest functional dependency is sought by the TANE algorithm, there is no need to process set Y, so set Y is pruned;
s14, continuing to generate the Li+1The layer returns to step S12 by setting i to i + 1.
The invention has the beneficial effects that: the invention automatically discovers the data quality detection rules in the data table by using the function dependence as the expression form of the data quality detection rules and reducing the redundancy rules on the basis, so that the discovered data quality detection rules have more practical value for the subsequent data management work, and provide basis for making corresponding decisions for data users while improving the data quality. The invention can effectively excavate the data quality detection rules existing between the attributes, reduce the workload of designing and configuring the data quality detection rules by field experts, improve the working efficiency, reduce the redundant rules, ensure that the semantics expressed by the finally generated data quality detection rules are more accurate, and provide powerful data support for making corresponding decisions for data users.
Drawings
FIG. 1 is a flow diagram of a rule reduction method based on data quality detection rule mining results;
FIG. 2 is a schematic diagram of an attribute containing grid according to the present invention;
FIG. 3 is a schematic diagram of an edge weighted graph G according to the present invention.
Detailed Description
Some terms to which the present invention relates are explained as follows:
1. function dependence
Function Dependency (FD) characterizes the relationship between attributes in a database relationship: that is, a function dependency characterizes the value of a certain attribute as being uniquely determined by several other attributes. For example, in an address database, the zip code is determined by the city and street address. Formally, the functional dependence on the relational schema R can be expressed as: x → A, wherein
Figure BDA0002263651830000031
A ∈ R, X is called the left part, A is called the right part. The condition that a certain functional dependency holds or is valid on a certain instance R of the relation R is: for all tuples t, u belonging to r, if t [ B ]]=u[B]Then t [ A ]]=u[A]It can also be said that t and u agree on the attribute set X and the attribute a. If the value of A does not depend on any subset of X, then the functional dependence of X → A is minimal. For example, if Y → A does not hold on r for any subset Y of X, then the functional dependency of X → A is minimal. In addition, if attribute A does not belong to attribute set X, then the functional dependence of X → A is nontrivial. The data quality detection rules are found from the given data table, i.e. for a given relation r, all the most significant ones of r are found to be trueSmall, non-trivial function dependence.
2. Data quality detection rule mining algorithm
Among algorithms for mining the data quality detection rule, the TANE algorithm is more commonly used. The TANE algorithm is a hierarchy-based algorithm that searches an attribute set containing lattice and obtains k +1 attributes from the k attributes. It first searches the attribute set containing only single attribute, and then searches the attribute set containing multiple attributes layer by layer through the structure of attribute search grid. When the algorithm is processing a certain attribute set X, it will detect whether a dependency like X \ A } → A (attribute A belongs to attribute set X) holds, which ensures that only non-trivial function dependencies will be considered. The small-to-large direction can also ensure that only the minimum dependence can be output, thereby being beneficial to carrying out efficient pruning operation.
TANE finds the smallest and non-trivial function dependence through the attribute set inclusion lattice. In order to detect the minimum dependency of X \ a } → a that may satisfy the requirement, it is necessary to determine whether Y \ a } → a holds for some property subsets Y of the property set X, and this information is stored in the right-hand candidate set c (Y) of Y.
If for a given property set X, there is a property A belonging to set C (X), then A will not depend on any appropriate subset of X. More precisely, the set of preliminary rhs candidates for set X is:
Figure BDA0002263651830000032
wherein,
Figure BDA0002263651830000033
to find the minimum dependency, it is necessary to detect whether X \ A } → A is satisfied, and in X \ A } → A, the attribute A belongs to the attribute set X, and for all the attributes B belonging to the attribute set X, there is A ∈ C (X \ B }). It is to be noted that it is preferable that,while
Figure BDA0002263651830000042
Figure BDA0002263651830000044
The functional dependency X \ B } \ { A } → A formed by the attribute A in the set is not minimal, because if X \ A } → A holds true, one attribute B can be removed from the remaining set of the left part X \ A } so that the dependency X \ B } \ { A } → A holds true. And a minimum function dependency requirement A ∈ C (X \ B }), that is, X \ A } → A, if true, cannot remove any attribute B in the left part of the dependency, otherwise X \ A } → A will not be true.
3. Dijkstra algorithm
Dijkstra's algorithm was proposed in 1959 by the netherlands computer scientist dikstra, and is therefore also called the dikstra algorithm. The method is a shortest path algorithm from one vertex to the rest of the vertices, and solves the shortest path problem in the weighted graph. The Dijkstra algorithm is mainly characterized in that the Dijkstra algorithm expands outwards layer by taking a starting point as a center until the expansion reaches a terminal point.
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, a rule reduction method based on a data quality detection rule mining result of the present invention includes the following steps:
s1, mining based on the data quality detection rule; the specific implementation method comprises the following steps: let relation mode be R, some example of R be R, attr (R) represents the set of all attributes of relation R, X is a certain attribute set on relation mode R, and A is a certain single attribute on relation mode R; let the expression of the mined data quality detection rule be FD: X → A, and find that FDs must satisfy the following two conditions:
minimum performance: means that if X → A holds, then Y → A does not hold for any one subset Y of X;
non-trivial: means that if X → A holds true, then attribute A does not belong to attribute set X;
step S1 includes the following substeps:
s11, scanning a database, and firstly, modeling all the antecedent values, namely the X set in X → A, on the relation R to obtain an attribute containing lattice, as shown in FIG. 2; in the search, the layer L composed of single attributes is first1Starting from L1Layer is obtained as2Layering until all attribute grids are searched; and L islThe number of attribute sets contained in the layer is l;
s12, if i is 1, it is the L-thiEach attribute set X of a layer computes a set C+(X):
Figure BDA0002263651830000045
S13, pruning: if it is not
Figure BDA0002263651830000046
Then for all supersets Y of X, there are
Figure BDA0002263651830000047
Since C (X) is empty, is composed of
Figure BDA0002263651830000048
In a clear view of the above, it is known that,
Figure BDA0002263651830000049
if not, then prove in the aggregate
Figure BDA00022636518300000410
The attribute A exists to ensure that X \ A } → A is established, the attribute set Y is a superset of the attribute set X, other attributes are added on the basis of X, and the dependence of X \ A } → A is also established; therefore, the dependence on the shape of Y \ A } → A also holds,
Figure BDA0002263651830000051
will not be empty, but this dependency is not minimal because there is a subset X of Y such that X \ A } → A holds, and because the TANE algorithm is to find that it is minimalSo there is no need to process set Y, so set Y is pruned;
s14, continuing to generate the Li+1The layer returns to step S12 by setting i to i + 1.
The TANE algorithm is based on the idea of partitioning attribute values of a row set, so that the effectiveness of function dependence can be quickly tested even if a large number of tuples are contained in the relation, and further the function dependence meeting the conditions existing in a data table can be quickly found out.
S2, carrying out rule reduction on the mining result, and comprising the following sub-steps:
s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G;
s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and giving a weight 1 to each edge, as shown in FIG. 3;
s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm;
and S24, reducing the existing data quality detection rules according to the shortest path obtained by each vertex, and removing redundant rules to obtain a final data quality detection rule set.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (2)

1. The rule reduction method based on the data quality detection rule mining result is characterized by comprising the following steps of:
s1, mining based on the data quality detection rule;
s2, carrying out rule reduction on the mining result, and comprising the following sub-steps:
s21, extracting all the attributes of the rules obtained in the step S1, and taking each attribute as a vertex of the graph G;
s22, drawing the edges of the graph G according to the dependency relationship among the attributes, and assigning a weight 1 to each edge;
s23, obtaining the shortest path of all the vertexes in the graph G by using a Dijkstra algorithm;
and S24, reducing the existing data quality detection rules according to the shortest path obtained by each vertex, and removing redundant rules to obtain a final data quality detection rule set.
2. The method for reducing rules based on mining results of data quality detection rules according to claim 1, wherein the step S1 is implemented by: let relation mode be R, some example of R be R, attr (R) represents the set of all attributes of relation R, X is a certain attribute set on relation mode R, and A is a certain single attribute on relation mode R; let the expression of the mined data quality detection rule be FD: X → A, and find that FDs must satisfy the following two conditions:
minimum performance: means that if X → A holds, then Y → A does not hold for any one subset Y of X;
non-trivial: means that if X → A holds true, then attribute A does not belong to attribute set X;
step S1 includes the following substeps:
s11, scanning a database, and firstly, modeling all the prior values, namely the X set in X → A, on the relation R to obtain an attribute containing lattice; in the search, the layer L composed of single attributes is first1Starting from L1Layer is obtained as2Layering until all attribute grids are searched; and L islThe number of attribute sets contained in the layer is l;
s12, if i is 1, it is the L-thiEach attribute set X of a layer computes a set C+(X):
Figure FDA0002263651820000011
S13, pruning: if it is not
Figure FDA0002263651820000012
Then for all supersets Y of X, there are
Figure FDA0002263651820000013
Since C (X) is empty, is composed of
Figure FDA0002263651820000014
In a clear view of the above, it is known that,
Figure FDA0002263651820000015
if not, then prove in the aggregate
Figure FDA0002263651820000016
The attribute A exists to ensure that X \ A } → A is established, the attribute set Y is a superset of the attribute set X, other attributes are added on the basis of X, and the dependence of X \ A } → A is also established; therefore, the dependence on the shape of Y \ A } → A also holds,
Figure FDA0002263651820000017
will not be empty, but this dependency is not minimal, since there is a subset X of Y such that X \ a } → a holds, and since the smallest functional dependency is sought by the TANE algorithm, there is no need to process set Y, so set Y is pruned;
s14, continuing to generate the Li+1The layer returns to step S12 by setting i to i + 1.
CN201911079988.4A 2019-11-07 2019-11-07 Rule reduction method based on data quality detection rule mining result Pending CN110825788A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911079988.4A CN110825788A (en) 2019-11-07 2019-11-07 Rule reduction method based on data quality detection rule mining result

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911079988.4A CN110825788A (en) 2019-11-07 2019-11-07 Rule reduction method based on data quality detection rule mining result

Publications (1)

Publication Number Publication Date
CN110825788A true CN110825788A (en) 2020-02-21

Family

ID=69553235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911079988.4A Pending CN110825788A (en) 2019-11-07 2019-11-07 Rule reduction method based on data quality detection rule mining result

Country Status (1)

Country Link
CN (1) CN110825788A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994194A (en) * 2023-03-23 2023-04-21 河北东软软件有限公司 Method, system, equipment and medium for checking data quality of government affair big data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010051947A1 (en) * 2000-05-02 2001-12-13 International Business Machines Corporation Spatial data mining method, spatial data mining apparatus and storage medium
CN102163300A (en) * 2011-04-20 2011-08-24 南京航空航天大学 Method for optimizing fault diagnosis rules based on ant colony optimization algorithm
CN106294715A (en) * 2016-08-09 2017-01-04 中国地质大学(武汉) A kind of association rule mining method based on attribute reduction and device
CN109614491A (en) * 2018-12-21 2019-04-12 成都康赛信息技术有限公司 Further method for digging based on data quality checking rule digging result
CN110097934A (en) * 2018-12-29 2019-08-06 蚌埠医学院 A kind of attributive character reduction method of electrocardio Ontological concept

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010051947A1 (en) * 2000-05-02 2001-12-13 International Business Machines Corporation Spatial data mining method, spatial data mining apparatus and storage medium
CN102163300A (en) * 2011-04-20 2011-08-24 南京航空航天大学 Method for optimizing fault diagnosis rules based on ant colony optimization algorithm
CN106294715A (en) * 2016-08-09 2017-01-04 中国地质大学(武汉) A kind of association rule mining method based on attribute reduction and device
CN109614491A (en) * 2018-12-21 2019-04-12 成都康赛信息技术有限公司 Further method for digging based on data quality checking rule digging result
CN110097934A (en) * 2018-12-29 2019-08-06 蚌埠医学院 A kind of attributive character reduction method of electrocardio Ontological concept

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
LEILEI CHANG等: "Structure learning for belief rule base expert system: A comparative study", 《ELSEVIER》 *
杨隆浩等: "基于关联系数标准差融合的置信规则库规则约简方法", 《信息与控制》 *
王雪珊: "数据质量校验规则提取技术的研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115994194A (en) * 2023-03-23 2023-04-21 河北东软软件有限公司 Method, system, equipment and medium for checking data quality of government affair big data
CN115994194B (en) * 2023-03-23 2023-06-02 河北东软软件有限公司 Method, system, equipment and medium for checking data quality of government affair big data

Similar Documents

Publication Publication Date Title
KR101105363B1 (en) Finding Frequent Item Sets for Long Transaction Data Streams
CN107229751A (en) A kind of concurrent incremental formula association rule mining method towards stream data
CN110069500B (en) Dynamic mixed indexing method for non-relational database
Min et al. Symmetric continuous subgraph matching with bidirectional dynamic programming
CN105117488B (en) A kind of distributed storage RDF data balanced division method based on hybrid hierarchy cluster
CN109635037B (en) Shard storage method and device for a relational distributed database
CN104408584A (en) Analysis method and system for transaction relevance
CN112925789B (en) Spark-based space vector data memory storage query method and system
CN112699134A (en) Distributed graph database storage and query method based on graph subdivision
CN116521956A (en) A graph database query method, device, electronic equipment and storage medium
Dhulipala et al. Terahac: Hierarchical agglomerative clustering of trillion-edge graphs
Cao et al. An improved method to build the KD tree based on presorted results
CN102043857B (en) All-nearest-neighbor query method and system
CN109697206A (en) A kind of distributive function dependence method for digging
CN110825788A (en) Rule reduction method based on data quality detection rule mining result
Naresh et al. Implementation of dynamic and fast mining algorithms on incremental datasets to discover qualitative rules
CN115114464A (en) Power grid graph database storage method based on multi-Hash algorithm
Lasek et al. Density-based clustering with constraints
Lin et al. Mining of high average-utility patterns with item-level thresholds
CN111107493B (en) Method and system for predicting position of mobile user
CN105740371A (en) Density-based incremental clustering data mining method and system
CN117370613A (en) Method and device for accelerating shortest path for large-scale graph data
Cao et al. Research on searching algorithms for unstructured grid remapping based on KD tree
CN110807061A (en) Method for searching frequent subgraphs of uncertain graphs based on layering
CN114116785A (en) A Distributed SPARQL Query Optimization Method Based on Minimum Attribute Cut

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200221

RJ01 Rejection of invention patent application after publication