CN102495834A

CN102495834A - Incremental data cleaning method based on memory mapping

Info

Publication number: CN102495834A
Application number: CN2011103169145A
Authority: CN
Inventors: 唐雪飞; 陈科; 汪海良; 李应洪
Original assignee: UESTC COMSYS INFORMATION CO Ltd
Current assignee: UESTC COMSYS INFORMATION CO Ltd
Priority date: 2011-10-18
Filing date: 2011-10-18
Publication date: 2012-06-13

Abstract

The invention discloses an incremental data cleaning method based on memory mapping. Data of a data source to be cleaned are loaded into a memory, and a memory MDB (meta database)-tree is built by the aid of a HASH list; quick data inquiry is completed by the aid of a top-down strategy, and operations of data insertion, deletion and the like are completed on the basis; views in the memory mapping are standardized into AUSPJ (aggregation, union, selection, projection and join) sections and D (differential) sections, and incremental maintenance of nodes is realized by the aid of relational algebraic operation; and finally data cleaning operation is realized by the aid of a result collection merging algorithmic method. The incremental data cleaning method can be widely applied to industries and fields with high data quantities and data quality requirements, such as education, telecommunication, governments and the like.

Description

Incremental data cleaning method based on memory mapping

Technical field

The invention belongs to computerized information analysis and data processing field, be specifically related to a kind of incremental data cleaning method based on memory mapping.

Background technology

When setting up an infosystem,, can not guarantee in all cases that the quality of institute's store data can both satisfy user's requirement even carried out good design and planning.User's typing mistake, combination of enterprise and corporate environment in time passing and change, these all can influence the quality of institute's store data.Therefore, be necessary to represent the quality of data with metadata.The consistance of data (consistency), correctness (correctness), integrality (completeness) and four indexs of minimality (minimality) are used for the degree of declarative data.The data cleansing process is exactly to solve identical concept is had different method for expressing, when integrated a plurality of data source, eliminates problems such as pattern conflict and duplicated records.

Nowadays many this cleaning meanss and data cleansing technology is arranged on the market; But great majority are that data are placed on the external disk; In case and data pollute to some extent must whole table in addition the entire database data warehouse clean totally, so just cause cleaning efficiency very low.In order to overcome the drawback that traditional data is cleaned; The present invention has designed a kind of brand-new data cleansing method; Make full use of the fireballing advantage of internal storage data library inquiry, increase and deletion; The data that will operate are written into internal memory and set up MDB-tree index, solve problems such as buffer memory sensitivity and TLB inefficacy, improve the Data Matching rate.The present invention simultaneously uses the incremental data cleaning method, utilizes relational algebra operation to realize the incremental maintenance of node, thereby avoids full dose attended operation consuming time.Prior art involved in the present invention comprises:

1, memory database (MDB) index technology

Tradition is based on the relational database system (DRDB) of disk; Because MDL disk resident; Issued transaction often relates to the disk I operation; The optimization aim of its architecture Design is how to reduce the number of times of read-write disk, is difficult to satisfy the demand of following based on network application system to the high-performance data access ability.The data of memory database (MDB) forever reside in the internal memory.MDB has the access efficiency higher than DRDB when visit data.Along with DRAM price constantly reduces, memory size enlarges, and the increasing database of storage becomes feasible in internal memory.

The index structure that DRDB adopts mainly is the B/B+ tree, and its design object is exactly to reduce IO number of visit data in magnetic disk.And in MDB, having adopted a kind of new index structure T tree, its design object is to reduce memory cost and cpu instruction number.The T tree is that it is the binary tree that an a kind of node comprises a plurality of elements by Adelson-Velskii-Landis tree and the development of B tree.Owing to be binary tree; The T tree has kept the high-level efficiency of Adelson-Velskii-Landis tree binary chop; Simultaneously a T node comprises a plurality of elements, as the B tree, the full level of each node remain on half-full and full up between; Move usually and only need carry out inserting and delete caused data at an intranodal, reduced for the balance that keeps tree the rotary manipulation that must carry out.Because all in internal memory, it is right in a T tree node, need as B sets, not deposit N index key assignments-pointer for index and data, only need deposit the pointer that points to respective record corresponding field in the internal memory, the storage of variable-length field no longer is a problem in the index like this.But along with the understanding of processor cache effect is deepened gradually, how the data access technology hot research direction of MDB makes good use of processor cache in access process, make performance can obtain largest optimization.And the buffer memory utilization factor of T tree is very poor, has restricted the excavation of performance potential on the contrary.Thereby people have developed the CSS tree, the MDB-tree of using among the responsive index technology of buffer memorys such as CSB+ tree and the present invention in succession.

2, data increment maintenance technology

In to data depot data cleaning process, need two kinds of ETL processes: full dose ETL process and increment ETL process, full dose ETL process is used for the initialization of data warehouse, and increment ETL process then is used for constantly new data being written into data warehouse.And the method for using the Materialized View incremental maintenance can utilize full dose ETL process to give birth to increment ETL process from movable property.But Materialized View proposes in order to improve the data warehouse search efficiency at first, and incremental maintenance is for effective Materialized View that upgrades, thereby the research that has the Materialized View incremental maintenance concentrates on data warehouse environment.Data in the data warehouse are all pressed the theme tissue; Use polymerization, selection, projection, connection computing in the inquiry in a large number; And seldom use difference operation, but often using difference operation in the ETL process rejects unwanted data, this be with data warehouse environment under Materialized View safeguard different places.

The data increment maintenance technology is at first standardized to the relational algebra operation node in the ETL process; Respectively the relational algebra operation node after the standardization is carried out incremental maintenance then; Mainly consider in the present invention 6 kinds of relational algebra operations (polymerized alpha, select σ, projection π, connection

and U, poor-), standard be AUSPJ (polymerization, also, selection, projection with connect) combination of fragment and D (poor) fragment.Provide the incremental maintenance method of these two kinds of fragments simultaneously.

3, result set merger technology

About the merger of result set is one of the basis that can normally use of this method.When returning Query Result, in fact the result set with all a plurality of thread enquiry modules carries out merger, before all Query Results arrive the homophony module, need do elementary ordering.Arrive after the homophony module through merger the result set of each module is carried out a merger.

Summary of the invention

In order to overcome the above-mentioned shortcoming of prior art; The invention provides a kind of incremental data cleaning method, made full use of the big advantage of computer nowadays internal memory, data to be cleaned are loaded into internal memory based on memory mapping; Set up MDB-tree index structure efficiently; Improve data access speed greatly, utilize the increment cleaning way, improve data cleansing efficient.

The technical solution adopted for the present invention to solve the technical problems is: a kind of incremental data cleaning method based on memory mapping comprises the steps:

The first step, at first data are loaded into internal memory, make up memory database MDB-tree index structure, after having set up MDB-tree index structure, utilize the crash rate of crash rate and the TLB of secondary cache to set up the cost model this tree construction is assessed;

Second goes on foot, utilizes the MDB-tree to carry out inquiry, insertion and the deletion action of data;

The 3rd step, with memory database Materialized View standard turn to AUSPJ (polymerization, also, selection, projection with connect) fragment and D (poor) fragment, the incremental maintenance of employing relational algebra operation realization node;

The 4th step, employing result set conflation algorithm are realized the data cleansing operation.

In said evaluation process:

1) calculate the quantity that the cache of the leaf node of MDB-tree lost efficacy by following formula:

Wherein: n is available node number, and q is the number of data item to be matched, and h is the height of tree, n _cBe the quantity of the Cache Line of a node leap, u is the average utilization of node or barrel chain;

2) calculate the number that TLB lost efficacy by following formula:

{TLB}_{m} = \frac{C_{m} - 1}{\log_{2} (u * n_{c}) + 0.5} + 1 .

In the query manipulation process of said data, the inter-node search algorithm is carried out as follows:

1) according to node pointer of importing and input key assignments; P.key in the node of this node pointed and input key assignments key are compared; Obtain the scope of key between this junction area; Judge that then this interval points to whether the pointer of child node is empty, if for empty then in the child node of this node, continue to search; If be sky then represent that this node is a leaf node;

2) use the hash function that key is done the hash computing and obtain the hash address; According to searching in the HASH table of hash address in leaf node; If do not find then search failure, if a plurality of index are arranged, then to search the key value successively in the barrel chain that is head pointer of this HASH address; If have then the index address that this key value is corresponding returns, if do not exist then search failure.

In the insertion of said data, deletion action process, the pairing leaf node of KEY that first inquiry will be inserted or delete is after leaf node is positioned; For inserting operation; Judge whether this node has enough spaces to hold new index entry, if insufficient space then divides leaf node; For deletion action, if delete after this index entry, there are not other index entries in this leaf node, then readjusts the MDB-tree.

Said employing relational algebra operation realizes that the incremental maintenance of node comprises the steps:

1) an ETL process specification that includes five kinds of fundamental relation computings and polymerization computing is turned to the combination of AUSPJ fragment and D fragment; The standardization result of ETL process is: a D fragment only comprises single difference operation; An AUSPJ fragment comprises five kinds of fundamental relation computings; And must satisfy one of following four kinds of situation: if a) the polymerization computing is arranged, then the polymerization computing must be in the top of AUSPJ fragment; B) the AUSPJ fragment is followed in D fragment back; C) the AUSPJ fragment is followed and is being connected the computing back, and union is then in the top of AUSPJ fragment; D) the AUSPJ fragment is a uppermost fragment in the ETL process;

2) two kinds of fragments are realized incremental maintenance:

Incremental maintenance to the AUSPJ fragment: must safeguard following three kinds of Materialized Views: all input relations that a) connect computing; B) output relation of project and union, and this relation will increase count attribute or allow to preserve repeating data; C) output relation of polymerization computing;

Incremental maintenance to the D fragment: its input relation must be a Materialized View, adopts SRA method or BRA method to realize.

Said BRA method is meant that for Materialized View C=A-B if satisfy one of following three kinds of situation, then C is from safeguarding:

A) A has only deletion increment and B to have only the insertion increment;

B) A has only deletion increment and B not to have increment;

C) A does not have increment and B to have only the insertion increment.

Said result set conflation algorithm is meant according to the result of each thread inquiry, and sub-result set is merged, and finally becomes complete Query Result.

Compared with prior art, good effect of the present invention is: the present invention all is written into internal memory with data and sets up memory database, sets up MDB-tree index structure, improves data search, deletion, insertion speed and efficient greatly.Help reducing data cleansing process design cost.It is big to can be applicable to data volume, industry and the field high to the data quality requirements.

Description of drawings

The present invention will explain through example and with reference to the mode of accompanying drawing, wherein:

Fig. 1 is a MDB-tree index structure;

Fig. 2 is a MDB-tree search index process flow diagram;

Fig. 3 is the definition figure of AUSPJ fragment;

Fig. 4 is the AUSPJ segment of incremental maintenance;

Fig. 5 is the parallel query merger process as a result synoptic diagram of 8 threads.

Embodiment

A kind of incremental data cleaning method based on memory mapping comprises the steps:

The first step, database all is written into internal memory, selects need carry out in the memory database table of index, the key assignments that needs the row of index in this table is set up MDB-set index structure:

At first be to set up similar B+ tree: set up rule according to the B+ tree key word is inserted in this tree, each node is not stored data, a storage key, and each intranodal key word is in order and comprise the pointer of number of keyword+1 a sensing child node.

All key words are after having set up the B+ tree; Leaf node is filled with the HASH table: the hash value of each key word being selected the hash function calculation key word of low calculated amount; Then key word and corresponding data are recorded in address in the internal memory and are put in the HASH table and go, just this hash result is stored in the barrel chain afterbody that starts with this hash address if the HASH address produces conflict.Set threshold values and will cause the leaf node division when the length of hash barrel chain surpasses.

As shown in Figure 1, MDB-tree index structure comprises two-layer, and ground floor is the same as common B+ tree, and it is the tree structure of a multilayer multiplexing.The second layer is each leaf node of this tree.Each leaf node is the mode through HASH all, data item key in the database is hashed in this HASH hash table leaf node go.

After having set up MDB-tree index structure, utilizing the crash rate of crash rate and the TLB of secondary cache to set up the cost model assesses this tree construction.

The quantity that the cache of the leaf node of in evaluation process, setting by following formula calculating MDB-lost efficacy:

C_{m} = \frac{Σ_{i = 1}^{h} X (n (i), q) * (\log_{2} (u * n_{c}) + 0.5)}{q}

Wherein: n is available node number, and q is the number of data item to be matched, and h is the height of tree, n _cBe the quantity of the Cache Line of a node leap.U is the average utilization of node or barrel chain.

Calculate the number that TLB lost efficacy by following formula:

{TLB}_{m} = \frac{C_{m} - 1}{\log_{2} (u * n_{c}) + 0.5} + 1

Second goes on foot, utilizes the MDB-tree to carry out inquiry, insertion and the deletion action of data:

1) query manipulation of data: search algorithm is substantially similar with the B+ tree, is top-downly to begin to search according to KEY the path of child node from root node, in leaf node, searches index entry at last.Obtaining under the pointer situation of certain node, the inter-node search algorithm is carried out as follows:

A) according to node pointer of importing and input key assignments; P.key in the node of this node pointed and input key assignments key are compared; Obtain the scope of key between this junction area; Judge that then this interval points to whether the pointer of child node is empty, if for empty then in the child node of this node, continue to search; If be sky then represent that this node is a leaf node;

B) use the hash function that key is done the hash computing and obtain the hash address; According to searching in the HASH table of hash address in leaf node, if do not find then search failure, if a plurality of index are arranged; Then to search the key value successively in the barrel chain that is head pointer of this HASH address; If have then the index address that this key value is corresponding returns, if do not exist then search failure, it is as shown in Figure 2 to search flow process;

2) insertion of data, deletion action:

The pairing leaf node of KEY that needs inquiry to insert or to delete after leaf node is positioned, for inserting operation, needs to judge whether this node has enough spaces to hold new index entry, if insufficient space needs the division leaf node; For deletion action,, need readjust the MDB-tree so if there are not other index entries in this leaf node after needing to judge this index entry of deletion.

The 3rd step, with memory database Materialized View standard turn to AUSPJ (polymerization, also, selection, projection with connect) fragment and D (poor) fragment, the incremental maintenance of employing relational algebra operation realization node:

Adopt relational algebra operation to realize that the incremental maintenance method of node comprises the steps:

1) an ETL process specification that includes 5 kinds of fundamental relation computings (select σ, projection π, connection

and U, poor-) and polymerization computing α is turned to the combination of two kinds of fragments (AUSPJ fragment and D fragment); The standardization result of ETL process is: a D fragment only comprises single difference operation; An AUSPJ fragment comprises other 5 kinds of computings; And with

series arrangement; An AUSPJ fragment might not comprise all 5 kinds of computings, but must satisfy one of following four kinds of situation:

If a) polymerization computing α is arranged, then polymerization computing α must be in the top of AUSPJ fragment;

B) the AUSPJ fragment is followed in D fragment back;

C) the AUSPJ fragment is followed and is being connected computing

at the back, and union U is then in the top of AUSPJ fragment;

D) the AUSPJ fragment is a uppermost fragment in the ETL process, promptly on it, has no other fragments.

This normalization method is based on following equality:

With moving on the U: π _A(R ∪ S)=π _A(R) ∪ π _A(S), σ _C(R ∪ S)=σ _C(R) ∪ σ _C(S)

With moving on the π:

With moving on the σ:

It should be noted that; The equality of AUSPJ fragment of being used for standardizing does not use equality

with moving on the U; This is will be divided into 2 or more because handle the single connection computing in back like this, can reduce execution efficient.

2) two kinds of fragments are realized incremental maintenance:

Incremental maintenance to the AUSPJ fragment: must safeguard following three kinds of Materialized Views: all input relations that a) connect computing; B) output relation of project and union, and this relation will increase count attribute or allow to preserve repeating data; C) output relation of polymerization computing.For the ease of understanding, explain with an example.

Consider the complete AUSPJ fragment of a shape like

; Wherein G representes the packet attributes set; B representes to participate in the community set of polymerization computing; And Ri has increment Delta Ri and

(1≤i≤3) arbitrarily, and the definition of then whole AUSPJ fragment is as shown in Figure 3.

Introduce the incremental maintenance method that how to realize below:

At first be to connect computing

According to following formula, obtain input and concern R ₁, R ₂Need materialization:

Selection computing σ afterwards _C, according to following formula, do not need any relation of materialization:

ΔR ₀＝σ _C(ΔR ₁)

&dtri; R_{0} = σ_{C} (&dtri; R_{1})

Concern Rpu after the public same increase count attribute of the project π of followed and union U.And before the polymerization computing α that Rpu is sent into the back, need a project π to remove count attribute; And because the usage count attribute has guaranteed that the tuple in the relation after the projection is inevitable unique; According to following formula, do not need any relation of materialization, directly output data is sent into polymerization computing α:

ΔR ₀＝π _A(ΔR ₁)

&dtri; R_{0} = π_{A} (&dtri; R_{1})

The output relation of polymerization computing α needs materialization, and if polymerization computing α asks maximum, minimum value, then need to import to concern materialization.The AUSPJ segment that is used for incremental maintenance is as shown in Figure 4.

Incremental maintenance to the D fragment: its input relation must be a Materialized View, generally adopts SRA (split relation access) method or BRA (base relation access) method to realize; The BRA method refers to for Materialized View C=A-B, if satisfy one of following three kinds of situation, then C is from safeguarding:

A) A has only deletion increment and B to have only the insertion increment;

B) A has only deletion increment and B not to have increment;

C) A does not have increment and B to have only the insertion increment.

The 4th step, employing result set conflation algorithm are realized the data cleansing operation:

Result according to each thread inquiry merges sub-result set, finally becomes complete Query Result.About the merger of result set is one of the basis that can normally use of this method.When returning Query Result, in fact the result set with all a plurality of thread enquiry modules carries out merger, before all Query Results arrive the homophony module, need do elementary ordering.Arrive after the homophony module through merger the result set of each module is carried out a merger.Fig. 5 is the parallel query merger process as a result synoptic diagram of 8 thread inquiries.

Claims

1. the incremental data cleaning method based on memory mapping is characterized in that: comprise the steps:

The first step, at first data are loaded into internal memory, make up memory database MDB-tree index structure, after having set up MDB-tree index structure, utilize the crash rate of crash rate and the TLB of secondary cache to set up the cost model this tree construction is assessed.

2. the incremental data cleaning method based on memory mapping according to claim 1 is characterized in that: in said evaluation process:

2) calculate the number that TLB lost efficacy by following formula:

{TLB}_{m} = \frac{C_{m} - 1}{\log_{2} (u * n_{c}) + 0.5} + 1 .

3. the incremental data cleaning method based on memory mapping according to claim 1 is characterized in that: in the query manipulation process of said data, the inter-node search algorithm is carried out as follows:

4. the incremental data cleaning method based on memory mapping according to claim 1; It is characterized in that: in the insertion of said data, deletion action process, the pairing leaf node of KEY that first inquiry will be inserted or delete is after leaf node is positioned; For inserting operation; Judge whether this node has enough spaces to hold new index entry, if insufficient space then divides leaf node; For deletion action, if delete after this index entry, there are not other index entries in this leaf node, then readjusts the MDB-tree.

5. the incremental data cleaning method based on memory mapping according to claim 1 is characterized in that: said employing relational algebra operation realizes that the incremental maintenance of node comprises the steps:

2) two kinds of fragments are realized incremental maintenance:

6. the incremental data cleaning method based on memory mapping according to claim 5 is characterized in that: said BRA method is meant that for Materialized View C=A-B if satisfy one of following three kinds of situation, then C is from safeguarding:

A) A has only deletion increment and B to have only the insertion increment;

B) A has only deletion increment and B not to have increment;

C) A does not have increment and B to have only the insertion increment.

7. the incremental data cleaning method based on memory mapping according to claim 1 is characterized in that: said result set conflation algorithm is meant according to the result of each thread inquiry, and sub-result set is merged, and finally becomes complete Query Result.