CN108776697A

CN108776697A - A kind of multi-source data collection cleaning method based on predicate

Info

Publication number: CN108776697A
Application number: CN201810578708.3A
Authority: CN
Inventors: 谢子哲; 李论; 刘奇志
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2018-11-09
Anticipated expiration: 2038-06-06
Also published as: CN108776697B

Abstract

The present invention proposes that a kind of method that the multi-source data collection cleaning method based on predicate is provided effectively can identify most reliable data item from isomorphism multi-source data concentration, is related to the fields such as data cleansing, data fusion.The method includes：1) predicate is excavated with automated method, and the predicate excavated is filtered；2) confidence level of each entity attributes value is concentrated according to predicate derived data；3) attribute value confidence level is established with the relationship between data source confidence level, calculates data source confidence level；4) data source confidence level and attribute value confidence level is combined to find out the highest data item of confidence level.For multiple data sources, the present invention can filter out redundancy, mistake and out-of-date data, leave the highest data of confidence level to coming from different data sources but the identical information of content is analyzed, for subsequent data analysis lay a good foundation, it is of great significance to the efficiency and accuracy rate of follow-up data processing.

Description

A predicate-based cleaning method for multi-source datasets

技术领域technical field

本发明涉及数据清洗、数据融合等领域，尤其是一种基于谓词的多源数据集清洗方法。The invention relates to the fields of data cleaning, data fusion and the like, in particular to a predicate-based multi-source data set cleaning method.

背景技术Background technique

在信息时代，可以从大量的数据源中找到对同一个事件或者物体的描述数据，同时由于时间错误、格式错误、精确度、完整性、语义上的歧义等原因，来自不同数据源对同一实体的描述存在不一致性。在从不同数据源搜集数据后，解决属于同一实体的描述数据之间的不一致性，对后续的数据分析至关重要。简单的投票策略——选择较多数据源支持的描述——并不适用于当下Web环境，而需要考虑数据源可信度、数据本身的可信度以及一些先验知识来设计更复杂的清洗策略。现有的清洗策略主要包括以下几种：In the information age, description data for the same event or object can be found from a large number of data sources. At the same time, due to time errors, format errors, accuracy, completeness, semantic ambiguity, etc., data for the same entity from different data sources There is an inconsistency in the description of . After collecting data from different data sources, resolving inconsistencies between descriptive data belonging to the same entity is crucial for subsequent data analysis. A simple voting strategy—selecting descriptions supported by more data sources—is not suitable for the current Web environment, and it is necessary to consider the credibility of data sources, the credibility of the data itself, and some prior knowledge to design more complex cleaning Strategy. Existing cleaning strategies mainly include the following:

中国专利201410387772号申请文件公开了“一种基于交通多源数据融合的公交路况处理系统及方法”，它融合来自不同数据源的描述公交路况的交通数据得到可供展示的路况信息。它的输入为特定交通数据，没有根据谓词进行可信度判断，也没有根据数据和数据源之间的关系计算数据源的可信度。The application document of Chinese Patent No. 201410387772 discloses "a bus traffic condition processing system and method based on traffic multi-source data fusion", which integrates traffic data describing bus traffic conditions from different data sources to obtain road condition information for display. Its input is specific traffic data, and it does not judge the credibility according to the predicate, nor calculate the credibility of the data source according to the relationship between the data and the data source.

中国专利201110369877号申请文件公开了“一种多源数据集成平台及其构建方法”，它是对不同的数据进行管理，这些数据之间不存在一致性问题。Chinese patent application document No. 201110369877 discloses "a multi-source data integration platform and its construction method", which manages different data, and there is no consistency problem among these data.

美国专利US 8190546号申请文件公开了“Dependency between sources intruth discovery”，它通过数据源之间的拷贝关系建立概率图模型来评估数据源和数据的可信度，并不涉及用谓词来评估数据的可信度。The US Patent No. US 8190546 application document discloses "Dependency between sources intruth discovery", which establishes a probabilistic graph model through the copy relationship between data sources to evaluate the credibility of data sources and data, and does not involve the use of predicates to evaluate data credibility.

发明内容Contents of the invention

发明目的：为了克服目前在多源数据融合中，描述相同实体的数据不一致的问题，也就是多源数据一致性问题中难以确定数据可信度初始值，以及如何结合数据源可信度和数据可信度的问题，本发明提供一种基于数据源可信度和数据可信度的多源数据集清洗方法，通过设定谓词计算数据可信度，再通过数据可信度计算数据源可信度，最终找出可信度最高的数据，达到数据清洗的目的。Purpose of the invention: In order to overcome the current problem of inconsistency in describing the data of the same entity in multi-source data fusion, that is, it is difficult to determine the initial value of data credibility in the problem of multi-source data consistency, and how to combine data source credibility and data To solve the problem of credibility, the present invention provides a method for cleaning multi-source datasets based on data source credibility and data credibility. The data credibility is calculated by setting predicates, and then the data source can be calculated through the data credibility. Reliability, and finally find the data with the highest reliability to achieve the purpose of data cleaning.

技术方案：为实现上述技术效果，本发明提出一种基于谓词的多源数据集清洗方法，包括步骤：Technical solution: In order to achieve the above technical effects, the present invention proposes a predicate-based multi-source data set cleaning method, including steps:

(1)构建谓词模型：定义优先级谓词、状态谓词和交互谓词；其中，(1) Build a predicate model: define priority predicates, status predicates and interaction predicates; among them,

优先级谓词为Prior(A_i，A_j)，表示属性A_i的优先级高于属性A_j的优先级；The priority predicate is Prior(A _i , A _j ), which means that the priority of attribute A _i is higher than that of attribute A _j ;

状态谓词为：其中，t_i表示语句i，表示语句i中属性A_k的属性值，表示预定义的与之间满足的条件，φ(t_i，t_j)表示预定义的t_i与t_j之间满足的条件；Stat(A_k)表示当t_i和t_j满足条件P和φ时，t_i的质量高于t_j；The status predicates are: Among them, t _i represents statement i, Indicates the attribute value of attribute A _k in sentence i, means predefined and φ(t _i , t _j ) represents the condition satisfied between the predefined t _i and t _j ; Stat(A _k ) represents when t _i and t _j satisfy the conditions P and φ, t _i has a mass higher than t _j ;

交互谓词为：Inter_δ(A₁，…，A_l)，表示当数据满足条件δ时，该条数据的属性A₁，…，A_l的属性值质量差；The interaction predicate is: Inter _δ (A ₁ ,...,A _l ), which means that when the data satisfies the condition δ, the quality of the attribute value of the attribute A ₁ ,...,A _l of the piece of data is poor;

(2)通过步骤(1)定义的谓词模型对待清洗的数据集进行谓词挖掘，得到数据集中的优先级谓词、状态谓词和交互谓词；(2) Perform predicate mining on the data set to be cleaned through the predicate model defined in step (1), and obtain priority predicates, status predicates and interaction predicates in the data set;

(3)根据得到的谓词推导数据集中各数据的属性值可信度，包括步骤：(3) Deduce the attribute value credibility of each data in the data set according to the obtained predicate, including steps:

(3-1)初始化数据集中数据的所有属性值可信度为0，并为每一条数据的各属性值设置影响因子η，η为一个常数；(3-1) All attribute value credibility of the data in the initialization data set is 0, and influence factor η is set for each attribute value of each piece of data, and η is a constant;

(3-2)运用状态谓词和交互谓词更新每条数据各属性值的可信度，更新时，先运用状态谓词更新再运用交互谓词更新，或先运用交互谓词更新再运用状态谓词更新；(3-2) Use state predicates and interaction predicates to update the credibility of each attribute value of each piece of data. When updating, first use the state predicates to update and then use the interaction predicates to update, or first use the interaction predicates to update and then use the state predicates to update;

运用状态谓词更新数据各属性值的可信度的步骤为：两两枚举数据集中的两条数据t_i和t_j，如果t_i和t_j在属性A_k上满足状态谓词：则将属性值的可信度减去η；The steps to update the credibility of each attribute value of the data using the state predicate are as follows: enumerate two data t _i and t _j in the data set in pairs, if t _i and t _j satisfy the state predicate on the attribute A _k : then attribute value The credibility minus η;

运用交互谓词更新数据各属性值的可信度的步骤为：遍历数据集中的所有数据，如果一条数据满足某个交互谓词Inter_δ(A₁，…，A_l)，则将该条数据属性A₁，…，A_l的属性值的可信度减去η；The steps to use the interaction predicate to update the credibility of each attribute value of the data are: traverse all the data in the data set, if a piece of data satisfies a certain interaction predicate Inter _δ (A ₁ ,...,A _l ), then the data attribute A ₁ ,... _, the credibility of the attribute value of Al minus η;

(3-3)在步骤(2)完成后，运用优先级谓词更新每条数据的属性值可信度，更新时，按照优先级从高到低的顺序依次执行优先级谓词；(3-3) After the step (2) is completed, use the priority predicate to update the attribute value credibility of each piece of data. When updating, execute the priority predicate in order of priority from high to low;

执行优先级谓词Prior(A_i，A_j)的步骤为：若多条数据在属性A_j上的属性值的可信度相同，则将它们按照A_i的属性值可信度做升序排序，按照排序后的顺序，在排在第n位的数据的A_j的属性值可信度上加上n-1；The steps of executing the priority predicate Prior(A _i , A _j ) are: if the attribute values of multiple pieces of data on attribute A _j have the same credibility, sort them in ascending order according to the attribute value credibility of A _i , According to the sorted order, add n-1 to the attribute value reliability of A _j of the nth-ranked data;

(3-4)得到所有属性值的可信度后，对于多值属性，返回所有可信度大于等于预设阈值的属性值作为结果；对于只需要返回一个结果的属性，执行步骤(4)至(6)；(3-4) After obtaining the credibility of all attribute values, for multi-valued attributes, return all attribute values with credibility greater than or equal to the preset threshold as the result; for attributes that only need to return one result, perform step (4) to (6);

(4)将所有属性值的可信度进行归一化；根据公式计算待清洗数据集中所有数据源的可信度；其中，λ_i表示数据源D_i的可信度，t表示数据源D_i中的一条数据，d(t)表示数据t的可信度，数据t的可信度等于该条数据所有属性值可信度之和；(4) Normalize the credibility of all attribute values; according to the formula Calculate the credibility of all data sources in the data set to be cleaned; where, λ _i represents the credibility of data source D _i , t represents a piece of data in data source D _i , d(t) represents the credibility of data t, The credibility of data t is equal to the sum of the credibility of all attribute values of the data;

(5)根据公式更新各属性值的可信度，D′表示对于属性A_j提供属性值的数据源；更新后返回步骤(4)；(5) According to the formula Update the credibility of each attribute value, D' means that attribute value is provided for attribute A _j The data source; after updating, return to step (4);

(6)重复执行步骤(4)至(5)，直至所有属性值的可信度收敛；对于只需返回一个结果的属性，找出该属性下可信度最高的属性值为最终结果。(6) Repeat steps (4) to (5) until the credibility of all attribute values converges; for an attribute that only needs to return one result, find out the attribute value with the highest reliability under this attribute as the final result.

进一步的，所述优先级谓词的定义方法为：对于属性A_i和A_j，若满足p_score(A_i)＜p_score(A_j)，则定义优先级谓词Prior(A_i，A_j)，表示属性A_i的优先级p_score(A_i)高于属性A_j的优先级p_score(A_j)；其中，H(A_i)表示属性A_i的香农熵，p_n(A_i)表示属性A_i的所有属性值中null值的比例。Further, the definition method of the priority predicate is: for the attributes A _i and A _j , if p _score (A _i )<p _score (A _j ), then define the priority predicate Prior(A _i , A _j ) , indicating that the priority p _score (A _i ) of attribute A _i is higher than the priority p _score (A _j ) of attribute A _j ; Among them, H(A _i ) represents the Shannon entropy of attribute A _i , and p _n (A _i ) represents the proportion of null values in all attribute values of attribute A _i .

进一步的，所述状态谓词和交互谓词均通过一阶逻辑谓词挖掘方法获得。Further, both the state predicate and the interaction predicate are obtained through a first-order logic predicate mining method.

进一步的，在对数据集进行清洗前，对于所有的数据集的所有属性进行人工标记，标记每个属性需要返回一个结果还是多个结果，如果一个属性只需返回一个结果，则标记该属性为单值属性，清洗时将该属性下可信度最高的属性值为最终结果；如果一个属性可能存在多个结果，则标记该属性为多值属性，清洗时将该属性下可信度大于预设阈值的所有属性值为最终结果。Further, before cleaning the data set, manually mark all the attributes of all the data sets, and mark whether each attribute needs to return one result or multiple results. If an attribute only needs to return one result, mark the attribute as For a single-valued attribute, the value of the attribute with the highest reliability under the attribute is the final result during cleaning; if there may be multiple results for an attribute, mark the attribute as a multi-valued attribute, and the attribute under the attribute with a reliability greater than the expected value is selected during cleaning. Set all attributes of Threshold to the final result.

有益效果：与现有技术相比，本发明具有以下优势：Beneficial effect: compared with the prior art, the present invention has the following advantages:

无需假定一个属性仅有一个正确值存在，也不依赖于众包，无需大量人工干预，利用自动挖掘出的谓词和数据集与属性值之间的关系找出可信度高的属性值。本发明通过挖掘自定义谓词来对属性值的可信度打分，对于多答案属性找出可信度高于预设阈值的属性值作为结果，对于剩下属性，结合数据源可信度和属性值可信度的关系进一步更新属性值可信度，找到可信度最高的属性值作为结果，对提高数据分析的效率和数据分析的精确度具有重要意义。采用本发明的技术方案，工程人员可以比较容易地实现相关软件。There is no need to assume that there is only one correct value for an attribute, nor does it depend on crowdsourcing, and it does not require a lot of manual intervention. It uses automatically mined predicates and the relationship between data sets and attribute values to find highly credible attribute values. The present invention scores the credibility of attribute values by digging self-defined predicates, finds attribute values whose credibility is higher than the preset threshold for multi-answer attributes as a result, and combines data source credibility and attribute values for the remaining attributes The relationship between value credibility further updates the attribute value credibility, and finding the attribute value with the highest credibility as a result is of great significance for improving the efficiency and accuracy of data analysis. By adopting the technical scheme of the invention, engineering personnel can realize related software relatively easily.

附图说明Description of drawings

图1为本发明的流程图；Fig. 1 is a flowchart of the present invention;

图2为本发明中更新数据源中属性值可信度的计算流程示意图。Fig. 2 is a schematic flow chart of calculating the credibility of attribute values in the updated data source in the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作更进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings.

图1所示为本发明的流程图，本发明主要包括以下几个部分：Shown in Fig. 1 is flow chart of the present invention, and the present invention mainly comprises following several parts:

a)首先定义三种谓词：a) First define three predicates:

1)优先级谓词：对于属性A_i和A_j，如果p_score(A_i)＜p_score(A_j)，则定义一个优先级谓词Prior(A_i，A_j)，表示属性A_i的优先级p_score(A_i)高于属性A_j的优先级p_score(A_j)；其中，H(A_i)表示属性A_i的香农熵，p_n(A_i)表示属性A_i的所有属性值中null值的比例。1) Priority predicate: For attributes A _i and A _j , if p _score (A _i )<p _score (A _j ), then define a priority predicate Prior(A _i , A _j ), which means the priority of attribute A _i Level p _score (A _i ) is higher than the priority p _score (A _j ) of attribute A _j ; Among them, H(A _i ) represents the Shannon entropy of attribute A _i , and p _n (A _i ) represents the proportion of null values in all attribute values of attribute A _i .

H(A_i)的计算公式为：H(A_i)＝-∑_x∈Xp(x)log₂p(x)，X为属性A_i属性值的值域，p(x)代表属性值x占所有属性值的比重(不包括null值)。The calculation formula of H(A _i ) is: H(A _i )＝-∑ _x∈X p(x)log ₂ p(x), X is the range of the attribute value of attribute A _i , and p(x) represents the attribute value x's weighting of all property values (excluding null values).

2)状态谓词：状态谓词为一阶逻辑谓词，其形式为：2) State predicate: The state predicate is a first-order logical predicate, and its form is:

表示t_i和t_j满足条件P和φ，则t_i的质量高于t_j。It means that t _i and t _j satisfy the conditions P and φ, then the quality of t _i is higher than that of t _j .

上述定义中的条件而f_i(v₁，v₂)可被v₁＝v₂或v₁≠v₂替换。状态谓词定义中的P可以被预先定义的6个谓词替换，分别是P₁(v₁，v₂)、P₂(v₁，v₂)、P₃(v₁，v₂)、P₄(v₁，v₂)、P₅(v₁，v₂)、P₆(v₁，v₂)。P₁(v₁，v₂)、P₂(v₁，v₂)适用于数值类型的属性值，P₁(v₁，v₂)表示v₁比v₂大，P₂(v₁，v₂)表示v₁比v₂小；P₃(v₁，v₂)、P₄(v₁，v₂)适用于字符类型的属性值，P₃(v₁，v₂)表示v₁比v₂长，P₄(v₁，v₂)表示v₁比v₂短；P₅(v₁，v₂)、P₆(v₁，v₂)适用于字符类型的属性值，字符串更详细代表其包含更多信息，其度量方式为用香农熵公式比较两个字符串包含的信息量，P₅(v₁，v₂)表示v₁比v₂更详细，P₆(v₁，v₂)表示v₁比v₂更简略。conditions in the above definition And f _i (v ₁ , v ₂ ) can be replaced by v ₁ =v ₂ or v ₁ ≠v ₂ . P in the state predicate definition can be replaced by 6 pre-defined predicates, namely P ₁ (v ₁ , v ₂ ), P ₂ (v ₁ , v ₂ ), P ₃ (v ₁ , v ₂ ), P ₄ (v ₁ , v ₂ ), P ₅ (v ₁ , v ₂ ), P ₆ (v ₁ , v ₂ ). P ₁ (v ₁ , v ₂ ), P ₂ (v ₁ , v ₂ ) are suitable for attribute values of numeric types, P ₁ (v ₁ , v ₂ ) means that v ₁ is larger than v ₂ , P ₂ (v ₁ , v ₂ ) means that v ₁ is smaller than v ₂ ; P ₃ (v ₁ , v ₂ ), P ₄ (v ₁ , v ₂ ) are suitable for attribute values of character types, and P ₃ (v ₁ , v ₂ ) means that v ₁ Longer than v ₂ , P ₄ (v ₁ , v ₂ ) indicates that v ₁ is shorter than v ₂ ; P ₅ (v ₁ , v ₂ ), P ₆ (v ₁ , v ₂ ) are suitable for character type attribute values, character A more detailed string means that it contains more information, which is measured by comparing the amount of information contained in two strings with the Shannon entropy formula. P ₅ (v ₁ , v ₂ ) means that v ₁ is more detailed than v ₂ , and P ₆ (v ₁ , v ₂ ) means that v ₁ is simpler than v ₂ .

3)交互谓词：交互谓词为一阶逻辑谓词，其形式为Inter_δ(A₁，…，A_l)，表示当一条数据满足条件δ，则该条数据的A₁，…，A_l属性值质量差。3) Interaction predicate: The interaction predicate is a first-order logical predicate, and its form is Inter _δ (A ₁ ,…,A _l ), which means that when a piece of data satisfies the condition δ, then the attribute value of A ₁ ,…,A _l of this piece of data low quality.

上述定义中的其中P_i′可以被P₁～P₆任意谓词替换，同时还可以被以下4个谓词替换：P₇(v₁，v₂)、P₈(v₁，v₂)、P₉(v₁，v₂)、P₁₀(v₁，v₂)；其中，P₇(v₁，v₂)、P₈(v₁，v₂)适用于字符类型的属性值，P₇(v₁，v₂)表示v₁包含v₂，P₈(v₁，v₂)表示v₁不包含v₂；P₉(v₁，v₂)、P₁₀(v₁，v₂)适用于字符类型和数值类型的属性值，P₉(v₁，v₂)表示v₁等于v₂，P₁₀(v₁，v₂)表示v₁不等于v₂。in the above definition Among them, P _i ′ can be replaced by any predicate from P ₁ to P ₆ , and can also be replaced by the following four predicates: P ₇ (v ₁ , v ₂ ), P ₈ (v ₁ , v ₂ ), P ₉ (v ₁ , v ₂ ), P ₁₀ (v ₁ , v ₂ ); among them, P ₇ (v ₁ , v ₂ ), P ₈ (v ₁ , v ₂ ) are suitable for attribute values of character types, and P ₇ (v ₁ , v ₂ ) means that v ₁ contains v ₂ , P ₈ (v ₁ , v ₂ ) means that v ₁ does not contain v ₂ ; P ₉ (v ₁ , v ₂ ), P ₁₀ (v ₁ , v ₂ ) are suitable for character types and attribute values of numerical type, P ₉ (v ₁ , v ₂ ) indicates that v ₁ is equal to v ₂ , and P ₁₀ (v ₁ , v ₂ ) indicates that v ₁ is not equal to v ₂ .

然后根据数据集进行谓词挖掘：对于优先级谓词，可以根据公式计算数据集所有属性的优先度获得；对于状态谓词和交互谓词，根据其一阶逻辑谓词的定义，由一阶归纳学习方法自动获得。在获取状态谓词和交互谓词后，为了进一步提高谓词的可用性，可以请领域专家去掉无效谓词得到最终可用的谓词；Then perform predicate mining according to the data set: for the priority predicate, it can be based on the formula The priority of all attributes of the calculation data set is obtained; for the state predicate and the interaction predicate, according to the definition of its first-order logic predicate, it is automatically obtained by the first-order inductive learning method. After obtaining the state predicates and interaction predicates, in order to further improve the usability of the predicates, domain experts can be asked to remove invalid predicates to obtain the final usable predicates;

b)首先初始化所有属性值的可信度为0，并人工设置参数η(影响因子)为一个实数。然后按以下顺序执行三类谓词推导出数据集中各实体的属性值的可信度；b) Initialize the reliability of all attribute values as 0, and manually set the parameter η (influence factor) to a real number. Then execute the three types of predicates in the following order to derive the credibility of the attribute values of each entity in the dataset;

1)运用状态谓词：两两枚举数据集中两条数据，如果这两条数据满足某个状态谓词则将的可信度减去η。1) Use state predicates: Enumerate two pieces of data in the data set in pairs, if these two pieces of data satisfy a certain state predicate then will The reliability minus η.

2)运用交互谓词：遍历数据集中的所有数据，如果一条数据满足某个交互谓词Inter_δ(A₁，…，A_l)，则将该条数据A₁，…，A_l属性值可信度减去η。2) Use the interaction predicate: traverse all the data in the data set, if a piece of data satisfies a certain interaction predicate Inter _δ (A ₁ ,…,A _l ), then the attribute value credibility of the piece of data A ₁ ,…,A _l Subtract η.

3)运用优先级谓词：由于属性的优先级是由决定的，为了使当前属性值的可信度是最新的，需先执行优先级高的优先级谓词，即所含两个属性的p_score之和较小的优先级谓词。优先级谓词Prior(A_i，A_j)的功能是，如果两条数据t₁、t₂满足条件此时可通过属性A_i的可信度来判断和哪个好。方法为，对于所有在A_j上可信度相同的多个数据，将它们按照A_i的可信度作升序排序，按照排序后的顺序，在排在第n位的数据的A_j的属性值可信度上加上n-1，这样就根据优先级较高的A_i区分出了可信度相同的A_j的值。同时注意对于需要返回多个结果的属性，如果可信度为负数，则无需运用优先级谓词。3) Use priority predicates: Since the priority of attributes is determined by It is determined that in order to make the credibility of the current attribute value the latest, the priority predicate with higher priority must be executed first, that is, the priority predicate with the smaller sum of the p _score of the two attributes included. The function of the priority predicate Prior(A _i , A _j ) is, if two pieces of data t ₁ and t ₂ satisfy the condition At this time, it can be judged by the credibility of attribute A _i and Which is better. The method is, for all multiple data with the same reliability on A _j , sort them in ascending order according to the reliability of A _i , and according to the sorted order, the attribute of A _j in the nth-ranked data Add n-1 to the value credibility, so that the value of A _j with the same credibility is distinguished according to the higher priority A _i . Also note that for attributes that need to return multiple results, if the confidence is negative, there is no need to apply the priority predicate.

得到所有属性值的可信度后，对于需要返回多个结果的属性，返回所有可信度大于等于预设阈值的属性值作为结果，对于需要返回一个结果的属性，继续以下步骤。After obtaining the credibility of all attribute values, for attributes that need to return multiple results, return all attribute values with credibility greater than or equal to the preset threshold as results, and for attributes that need to return one result, continue with the following steps.

c)根据计算数据源的可信度，并对所有数据源的可信度进行归一化，即∑_iλ_i＝1，如图2所示。再用更新每个属性值的可信度，每个属性值的可信度等于提供该属性值的数据源的可信度之和乘上自己原有的可信度，注意null值的数据源可信度就是自身数据源的可信度，不包括其它提供null值的数据源的可信度。接着，同样对所有属性值的可信度进行归一化，使得对于任意一个属性，其所有可能取到的值的可信度和为1。重复上述步骤直到数据源的可信度和每个属性值的可信度收敛。c) According to Calculate the credibility of the data source, and normalize the credibility of all data sources, that is, ∑ _i λ _i =1, as shown in Figure 2. reuse Update the credibility of each attribute value. The credibility of each attribute value is equal to the sum of the credibility of the data source that provides the attribute value multiplied by its original credibility. Note that the data source with a null value is credible Degree refers to the credibility of its own data source, excluding the credibility of other data sources that provide null values. Next, the credibility of all attribute values is also normalized, so that for any attribute, the sum of the credibility of all possible values is 1. Repeat the above steps until the credibility of the data source and the credibility of each attribute value converge.

d)最后，对于只需返回一个结果的属性，找出该属性下可信度最高的属性值为最终结果，结合b)中的结果作为最终结果。d) Finally, for an attribute that only needs to return one result, find out the attribute value with the highest reliability under this attribute as the final result, and combine the result in b) as the final result.

下面将结合具体样例，说明本发明的实施方式：Below in conjunction with specific examples, the embodiment of the present invention is described:

我们令一个数据源D_i的可信度为λ_i，该数据源属性集合为{A₁，…，A_n}，而t∈D_i为该数据源的一条数据，其中代表t对应A_j的属性值。再令d(t)表示该条数据的可信度，表示属性值的可信度。一条数据的可信度等于该条数据所有属性值可信度之和，即：We let the credibility of a data source D _i be λ _i , the attribute set of the data source is {A ₁ ,...,A _n }, and t∈D _i is a piece of data of the data source, where Indicates that t corresponds to the attribute value of A _j . Then let d(t) represent the credibility of the piece of data, Indicates the attribute value credibility. The credibility of a piece of data is equal to the sum of the credibility of all attribute values of the piece of data, namely:

而一个数据源的可信度等于其包含的所有数据的可信度的平均值，即所有数据的可信度之和除以数据的数目：The credibility of a data source is equal to the average of the credibility of all the data it contains, that is, the sum of the credibility of all data divided by the number of data:

同时，令D′为对于属性A_j提供属性值的数据源，则属性值的可信度为所有提供该值的数据源的可信度之和乘上自己原有的可信度：At the same time, let D′ be the property value that provides the property value for property A _j data source, the property value The credibility of is the sum of the credibility of all data sources that provide the value multiplied by its original credibility:

实施例：清洗数据集如下表所示，一共有5条数据和5个数据源，其中t_i来自数据源D_i，描述了一个名叫Mary的科研人员数据。Example: The cleaning data set is shown in the following table. There are 5 pieces of data and 5 data sources in total, among which t _i comes from the data source D _i and describes the data of a scientific researcher named Mary.

清洗数据集表Clean the dataset table

首先，先简单观察一下数据集，做一个人工预处理，清除一些明显不合理的数据，使得后续的数据清洗效率更高，效果更好。First of all, simply observe the data set first, do a manual preprocessing, and clear some obviously unreasonable data, so that the subsequent data cleaning efficiency is higher and the effect is better.

比如上面数据集中的t₅这一条数据，salary为负值，明显不合理，并且这条数据中后面的Research Area，Affiliation和Publication这三个属性的值也没有太大意义，因此可以将t₅这条数据视作噪声直接删除，而不参与后续的清洗操作。For example, in the data of t ₅ in the above data set, the salary is a negative value, which is obviously unreasonable, and the values of the three attributes of Research Area, Affiliation and Publication in this data are not very meaningful, so t ₅ can be This piece of data is directly deleted as noise without participating in subsequent cleaning operations.

再看t₄这条数据，它的publication这个属性的值为“-”，这个也属于不合理的数据，但因为t₄这条数据其他属性的值仍有参考价值，因此可以直接将publication的值改为“null”。Looking at the data of t ₄ again, the value of its publication attribute is "-", which is also unreasonable data, but because the values of other attributes of the data of t ₄ still have reference value, you can directly use the publication Value changed to "null".

经过上述简单的预处理之后，得到的数据集如下：After the above simple preprocessing, the obtained data set is as follows:

第一步，对数据集进行谓词挖掘。In the first step, predicate mining is performed on the dataset.

挖掘优先级谓词Mining Priority Predicates

对于优先级谓词，统计所有属性的熵和null值的比例。For priority predicates, count the entropy of all attributes and the proportion of null values.

熵计算公式：H(A_i)＝-∑_x∈Xp(x)log₂p(x)Entropy calculation formula: H(A _i )＝-∑ _x∈X p(x)log ₂ p(x)

其中，p(x)代表属性值x占所有属性值的比重(不包括null值)；Among them, p(x) represents the proportion of attribute value x to all attribute values (excluding null values);

以salary为例，共有3个属性值，分别是142k，120k和88k。Taking salary as an example, there are three attribute values, namely 142k, 120k and 88k.

其中，则可得属性salary的熵为：in, Then the entropy of attribute salary can be obtained as:

同理：In the same way:

p_n(Salary)＝p_n(ResearchArea)＝0p _n (Salary) = p _n (ResearchArea) = 0

可以得到：can get:

根据上述小于关系可定义三个优先级谓词：According to the above less than relationship, three priority predicates can be defined:

挖掘状态谓词：Mining state predicates:

由一阶逻辑谓词挖掘算法First Order Inductive Learner自动获得：Automatically obtained by the first-order logic predicate mining algorithm First Order Inductive Learner:

挖掘交互谓词：Mining interaction predicates:

同样由一阶逻辑谓词挖掘算法First Order Inductive Learner自动获得：Also automatically obtained by the first-order logical predicate mining algorithm First Order Inductive Learner:

第二步，推导各实体的属性值的可信度。The second step is to derive the credibility of the attribute values of each entity.

初始化所有属性值的可信度为0，设置影响因子η＝1，对于需要返回多个结果的属性，设可信度置阈值为0。然后按一定的顺序使用相应的谓词(不同的谓词执行顺序可能会产生不同的结果)。Initialize the reliability of all attribute values to 0, set the impact factor η=1, and set the reliability threshold to 0 for attributes that need to return multiple results. Then use the corresponding predicates in a certain order (different predicate execution orders may produce different results).

状态谓词和交互谓词都是作用于属性值，而属性值不会改变，因此状态谓词与状态谓词之间、交互谓词与交互谓词之间，以及状态谓词和交互谓词之间均相互独立，可以以任意顺序调用。Both the state predicate and the interaction predicate act on the attribute value, and the attribute value will not change, so the relationship between the state predicate and the state predicate, between the interaction predicate and the interaction predicate, and between the state predicate and the interaction predicate are independent of each other. Called in any order.

然而，优先级谓词作用在属性值的可信度之上，因此优先级谓词必须在所有状态谓词和交互谓词之后使用。However, priority predicates act on the trustworthiness of attribute values, so priority predicates must be used after all state and interaction predicates.

除此之外，优先级谓词和优先级谓词之间也必须遵守一定的顺序。为了使当前属性值的可信度是最新的，需先执行优先级高的优先级谓词，即所含两个属性的优先级之和较小的优先级谓词。对于清洗数据集表的优先级谓词，的p_score之和为3.5，的p_score之和也为3.5，的p_score之和为4.11，所以应按照的顺序执行优先级谓词。In addition, a certain order must be observed between priority predicates and priority predicates. In order to make the credibility of the current attribute value up-to-date, the priority predicate with higher priority needs to be executed first, that is, the priority predicate with the smaller sum of the priorities of the two attributes included. For priority predicates on cleaning dataset tables, The sum of the p _scores is 3.5, The sum of the p _scores is also 3.5, The sum of the p _scores is 4.11, so it should be according to The order in which the priority predicates are executed.

在执行状态谓词后，所有属性值的可信度如表1所示。这里有4条数据，需两两进行比较，一共进行了16次比较。以t₁和t₂为例，根据状态谓词因为所以将减1。After executing the state predicate, the credibility of all attribute values is shown in Table 1. There are 4 pieces of data here, which need to be compared in pairs, and a total of 16 comparisons have been made. Taking t ₁ and t ₂ as examples, according to the state predicate because So will minus 1.

表1Table 1

SalarySalary Research AreaResearch Area AffiliationAffiliation PublicationPublication t₁(D₁)t ₁ (D ₁ ) 00 00 00 00 t₂(D₂)t ₂ (D ₂ ) -1-1 00 00 00 t₃(D₃)t ₃ (D ₃ ) -2-2 00 00 00 t₄(D₄)t ₄ (D ₄ ) -2-2 00 00 00

在执行交互谓词后，所有属性值的可信度如表2所示。此处根据交互谓词和将所有Affiliation和Publication值为null的属性值可信度减1。After executing the interaction predicate, the credibility of all attribute values is shown in Table 2. Here according to the interaction predicate and Decrease the credibility of all attribute values with null Affiliation and Publication values by 1.

表2Table 2

SalarySalary Research AreaResearch Area AffiliationAffiliation PublicationPublication t₁(D₁)t ₁ (D ₁ ) 00 00 00 00 t₂(D₂)t ₂ (D ₂ ) -1-1 00 00 -1-1 t₃(D₃)t ₃ (D ₃ ) -2-2 00 00 00 t₄(D₄)t ₄ (D ₄ ) -2-2 00 -1-1 -1-1

在执行优先级谓词后，所有属性值的可信度如表3所示。以Research Area为例，初始Research Area列的值为{0，0，0，0]，根据优先级谓词可以根据Salary列的{0，-1，-2，-2)来更新Research Area列的值。将Research Area列中可信度相同的值根据Salary列的升序重新排序，排序后分别加上0，1，2，还原顺序后得到{2，1，0，0]。同理再执行优先级谓词和 After executing the priority predicate, the credibility of all attribute values is shown in Table 3. Taking Research Area as an example, the initial value of the Research Area column is {0, 0, 0, 0], according to the priority predicate The value of the Research Area column can be updated according to {0, -1, -2, -2) of the Salary column. Reorder the values with the same credibility in the Research Area column according to the ascending order of the Salary column, add 0, 1, and 2 respectively after sorting, and get {2, 1, 0, 0] after restoring the order. In the same way, execute the priority predicate again and

表3table 3

SalarySalary Research AreaResearch Area AffiliationAffiliation PublicationPublication t₁(D₁)t ₁ (D ₁ ) 00 22 22 11 t₂(D₂)t ₂ (D ₂ ) -1-1 11 11 -1-1 t₃(D₃)t ₃ (D ₃ ) -2-2 00 00 00 t₄(D₄)t ₄ (D ₄ ) -2-2 00 -1-1 -1-1

接着对所有属性进行标记，一个科研人员同一时间只能有一个工资和一个隶属机构，所以Salary和Affiliation都只返回一个结果，但是一个科研人员的研究领域和著作却可以有多个，因此ResearchArea和Publication属性应该返回多个结果值。对于多值属性，此时返回大于等于阈值0的所有属性值，即对于Research Area，返回结果为{Dataintegration，data cleaning Data cleaning&Google Knowledge managementInformation retrieve}，对于Publication，返回结果为{Data integration，Adiagnostic tool for data errors}。Then mark all the attributes. A researcher can only have one salary and one affiliation at the same time, so Salary and Affiliation only return one result, but a researcher can have multiple research fields and works, so ResearchArea and The Publication property should return multiple result values. For multi-valued attributes, all attribute values greater than or equal to the threshold 0 are returned at this time, that is, for Research Area, the returned result is {Data integration, data cleaning Data cleaning&Google Knowledge managementInformation retrieve}; for Publication, the returned result is {Data integration, Adiagnostic tool for data errors}.

第三步，计算数据源可信度。The third step is to calculate the credibility of the data source.

接着通过把所有属性值的可信度值映射到(0，1)，并且归一化，结果如表4所示。并且根据计算所有数据源的可信度。then pass Map the credibility values of all attribute values to (0, 1), and normalize, the results are shown in Table 4. and according to Calculate the credibility of all data sources.

表4Table 4

SalarySalary Research AreaResearch Area AffiliationAffiliation PublicationPublication λlambda t₁(D₁)t ₁ (D ₁ ) 0.4963530.496353 0.337230.33723 0.3699590.369959 0.4132750.413275 0.4042040.404204 t₂(D₂)t ₂ (D ₂ ) 0.266980.26698 0.27990.2799 0.3070650.307065 0.1520350.152035 0.2514950.251495 t₃(D₃)t ₃ (D ₃ ) 0.1183330.118333 0.1914350.191435 0.2100140.210014 0.2826550.282655 0.2006090.200609 t₄(D₄)t ₄ (D ₄ ) 0.1183330.118333 0.1914350.191435 0.1129630.112963 0.1520350.152035 0.1436920.143692

最后通过和两个式子迭代的更新所有属性值的可信度直至所有属性值的可信度收敛，注意每次按列更新属性值的可信度后需要将该列可信度归一化。以第一次更新过程为例，对于Salary列的属性值可信度：finally passed and The two formulas update the credibility of all attribute values iteratively until the credibility of all attribute values converges. Note that after updating the credibility of attribute values by column each time, the credibility of the column needs to be normalized. Taking the first update process as an example, for the attribute value credibility of the Salary column:

{0.496353，0.26698，0.118333，0.118333]{0.496353, 0.26698, 0.118333, 0.118333]

→{0.496353×0.404204，0.26698×0.251495，0.118333×0.344301，0.118333×0.344301]→{0.496353×0.404204, 0.26698×0.251495, 0.118333×0.344301, 0.118333×0.344301]

→{0.200628，0.0671441，0.0407422，0.0407422)→{0.200628, 0.0671441, 0.0407422, 0.0407422)

同理，对于Research Area列的属性值可信度：Similarly, for the attribute value credibility of the Research Area column:

{0.33723，0.2799，0.191435，0.191435)→{0.500009，0.258216，0.140871，0.100903]{0.33723, 0.2799, 0.191435, 0.191435) → {0.500009, 0.258216, 0.140871, 0.100903]

对于Affiliation列的属性值可信度：For the attribute value credibility of the Affiliation column:

{0.369959，0.307065，0.210014，0.112963)→{0.404204，0.251495，0.200609，0.143692){0.369959, 0.307065, 0.210014, 0.112963) → {0.404204, 0.251495, 0.200609, 0.143692)

对于Publication列的属性值可信度：For the attribute value credibility of the Publication column:

{0.413275，0.152035，0.282655，0.152035}→{0.588542，0.134713，0.199777，0.0769686]{0.413275, 0.152035, 0.282655, 0.152035} → {0.588542, 0.134713, 0.199777, 0.0769686]

最后更新数据源的可信度：Finally update the credibility of the data source:

λ＝{0.5168，0.209168，0.164478，0.109554)λ = {0.5168, 0.209168, 0.164478, 0.109554)

重复上述过程直至收敛，最终结果如表5所示。Repeat the above process until convergence, and the final results are shown in Table 5.

表5table 5

SalarySalary Research AreaResearch Area AffiliationAffiliation PublicationPublication λlambda t₁(D₁)t ₁ (D ₁ ) 11 11 11 11 11 t₂(D₂)t ₂ (D ₂ ) 00 00 00 00 00 t₃(D₃)t ₃ (D ₃ ) 00 00 00 00 00 t₄(D₄)t ₄ (D ₄ ) 00 00 00 00 00

第四步，得出结果。The fourth step is to get the result.

根据表5，可选出Salary和Affiliation属性中最好的属性值，即可信度最大的属性值为结果。其中Salary的结果为{142k}，Affiliation的结果为{Amazon}。According to Table 5, the best attribute value among the Salary and Affiliation attributes can be selected, that is, the attribute value with the highest reliability is the result. The result of Salary is {142k}, and the result of Affiliation is {Amazon}.

以上所述仅是本发明的优选实施方式，应当指出：对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above is only a preferred embodiment of the present invention, it should be pointed out that for those of ordinary skill in the art, without departing from the principle of the present invention, some improvements and modifications can also be made, and these improvements and modifications are also possible. It should be regarded as the protection scope of the present invention.

Claims

1. A predicate-based multi-source data set cleaning method, characterized in that, comprising steps:

(1) Build a predicate model: define priority predicates, status predicates and interaction predicates; among them,

The priority predicate is Prior(A _i , A _j ), which means that the priority of attribute A _i is higher than that of attribute A _j ;

The status predicates are: Among them, t _i represents statement i, Indicates the attribute value of attribute A _k in sentence i, means predefined and φ(t _i , t _j ) represents the condition satisfied between the predefined t _i and t _j ; Stat(A _k ) represents when t _i and t _j satisfy the conditions P and φ, t _i has a mass higher than t _j ;

The interaction predicate is: Inter _δ (A ₁ ,...,A _l ), which means that when the data satisfies the condition δ, the quality of the attribute value of the attribute A ₁ ,...,A _l of the piece of data is poor;

(2) Perform predicate mining on the data set to be cleaned through the predicate model defined in step (1), and obtain priority predicates, status predicates and interaction predicates in the data set;

(3) Deduce the attribute value credibility of each data in the data set according to the obtained predicate, including steps:

(3-1) All attribute value credibility of the data in the initialization data set is 0, and influence factor η is set for each attribute value of each piece of data, and η is a constant;

(3-2) Use state predicates and interaction predicates to update the credibility of each attribute value of each piece of data. When updating, first use the state predicates to update and then use the interaction predicates to update, or first use the interaction predicates to update and then use the state predicates to update;

The steps to update the credibility of each attribute value of the data using the state predicate are as follows: enumerate the two data t _i and t _j in the data set in pairs, if t _i and t _j satisfy the state predicate on the attribute A _k : then attribute value The credibility minus η;

The steps to use the interaction predicate to update the credibility of each attribute value of the data are: traverse all the data in the data set, if a piece of data satisfies a certain interaction predicate Inter _δ (A ₁ ,...,A _l ), then the data attribute A ₁ ,... _, the credibility of the attribute value of Al minus η;

(3-3) After the step (2) is completed, use the priority predicate to update the attribute value credibility of each piece of data. When updating, execute the priority predicate in order of priority from high to low;

The steps of executing the priority predicate Prior(A _i , A _j ) are: if the attribute values of multiple pieces of data on attribute A _j have the same credibility, sort them in ascending order according to the attribute value credibility of A _i , According to the sorted order, add n-1 to the attribute value reliability of A _j of the nth-ranked data;

(3-4) After obtaining the credibility of all attribute values, for multi-valued attributes, return all attribute values whose credibility is greater than or equal to the preset threshold as the result; for attributes that only need to return one result, perform step (4) to (6);

(4) Normalize the credibility of all attribute values; according to the formula Calculate the credibility of all data sources in the data set to be cleaned; where, λ _i represents the credibility of data source D _i , t represents a piece of data in data source D _i , d(t) represents the credibility of data t, The credibility of data t is equal to the sum of the credibility of all attribute values of the data;

(5) According to the formula Update the credibility of each attribute value, D' means that attribute value is provided for attribute A _j The data source; after updating, return to step (4);

(6) Repeat steps (4) to (5) until the credibility of all attribute values converges; for an attribute that only needs to return one result, find out the attribute value with the highest reliability under this attribute as the final result.

2. A predicate-based multi-source data set cleaning method according to claim 1, wherein the method for defining the priority predicate is: for attributes A _i and A _j , if p _score (A _i )<p _score (A _j ), then define a priority predicate Prior(A _i , A _j ), which means that the priority p _score (A _i ) of attribute A _i is higher than the priority p _score (A _j ) of attribute A _j ; Among them, H(A _i ) represents the Shannon entropy of attribute A _i , and p _n (A _i ) represents the proportion of null values in all attribute values of attribute A _i .

3. A predicate-based multi-source data set cleaning method according to claim 2, characterized in that, both the state predicate and the interaction predicate are obtained through a first-order logic predicate mining method.

4. A predicate-based multi-source data set cleaning method according to claim 3, characterized in that, before the data set is cleaned, all attributes of all data sets are manually marked, and each attribute needs to be marked Return one result or multiple results. If an attribute only needs to return one result, mark the attribute as a single-valued attribute. When cleaning, the attribute with the highest reliability under the attribute is the final result; if there may be multiple As a result, the attribute is marked as a multi-valued attribute, and all attributes under the attribute whose reliability is greater than a preset threshold are valued as final results during cleaning.