CN107145523B

CN107145523B - Alignment method for large heterogeneous knowledge bases based on iterative matching

Info

Publication number: CN107145523B
Application number: CN201710237034.6A
Authority: CN
Inventors: 陈岭; 顾伟东
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2017-04-12
Filing date: 2017-04-12
Publication date: 2019-10-18
Anticipated expiration: 2037-04-12
Also published as: CN107145523A

Abstract

The invention discloses a large-scale heterogeneous knowledge base alignment method based on iterative matching. The specific implementation is as follows: 1) Screen the data in the original knowledge base, unify the data format, and obtain the relationship and initial knowledge base in the knowledge base on this basis. Match entity pairs; 2) use the relationship in the knowledge base to partition the preprocessed knowledge base, and simplify the blocks; 3) use the matched entity pairs to match blocks to obtain matching block pairs; 4) match block pairs Select candidate entity pairs, and combine the similarity measurement method and threshold to confirm candidate entity pairs; 5) Repeat the above steps until no new candidate entity pairs can be found, and all matching entity pairs are obtained. The invention combines the idea of iterative matching to align heterogeneous knowledge bases, and has broad application prospects in the fields of knowledge base alignment, data fusion, automatic question answering and the like.

Description

Alignment method for large heterogeneous knowledge bases based on iterative matching

技术领域technical field

本发明涉及知识库对齐领域，尤其涉及一种基于迭代匹配的大型异构知识库对齐方法。The invention relates to the field of knowledge base alignment, in particular to a large-scale heterogeneous knowledge base alignment method based on iterative matching.

背景技术Background technique

随着Web 3.0的到来，结构化的知识库越来越频繁地出现在互联网上。这些知识库被广泛应用于各类语义应用中，例如：自动问答、搜索服务和社交服务等。然而，单个知识库的信息有限，限制了这些应用的功能。在此背景下，知识库对齐有了巨大的发展空间。知识库对齐(Knowledge Base Alignment)通常指知识库的实体对齐，即自动发现代表现实中同一事物的两个实体并连接它们。With the advent of Web 3.0, structured knowledge bases are appearing more and more frequently on the Internet. These knowledge bases are widely used in various semantic applications, such as automatic question answering, search services, and social services. However, the limited information of a single knowledge base limits the functionality of these applications. In this context, knowledge base alignment has huge room for development. Knowledge Base Alignment (Knowledge Base Alignment) usually refers to the entity alignment of the knowledge base, that is, to automatically discover two entities representing the same thing in reality and connect them.

由于知识库规模的不断增长，知识库对齐方法通常将对齐过程分为两个步骤：发现候选实体对和确认候选实体对。发现候选实体对通常利用少量属性快速为每个实体筛选出几个候选实体，确认候选实体对通过全面比较两个实体，利用相似度和阈值判断两实体是否匹配。由于避免了实体两两之间的精确比较，这种做法大大提高了方法的整体效率。目前，知识库对齐方法的瓶颈在于发现的候选实体对常常有所遗漏，进一步导致可匹配的实体对未被发现。Due to the ever-increasing size of knowledge bases, knowledge base alignment methods usually divide the alignment process into two steps: discovering candidate entity pairs and confirming candidate entity pairs. Discovering candidate entity pairs usually uses a small number of attributes to quickly screen out several candidate entities for each entity, and confirming candidate entity pairs comprehensively compares two entities, and uses similarity and threshold to determine whether the two entities match. This approach greatly improves the overall efficiency of the method by avoiding exact pairwise comparisons of entities. At present, the bottleneck of the knowledge base alignment method is that the discovered candidate entity pairs are often missed, which further leads to the undiscovered matching entity pairs.

为提高候选实体对的质量，研究人员提出使用迭代匹配的思想，即每轮发现少量的匹配实体对，并作为下一轮发现候选实体对的依据。然而，传统的知识库对齐方法通常关注同构知识库对齐，即两知识库间有较多可对齐关系。其基本假设为：如果一对实体对匹配，并且它们有对齐的关系，那么它们的“兼容邻居”有较大概率匹配，因此将“兼容邻居”作为候选实体对。但是，由于知识库间可对齐关系少，传统方法将遗漏部分候选实体对。为了解决该问题，研究人员提出使用基于类的知识库对齐方法。该方法将具有相同特征的实例划分到同一个类中，并排除与类的内容不相关的候选实体，以此来确认候选实体对。然而，由于该方法仅在模型初始阶段通过经典的分区技术获取候选实体对，因此当两个知识库间对齐的属性较少时，该方法也将遗漏较多的候选实体对。In order to improve the quality of candidate entity pairs, researchers propose the idea of using iterative matching, that is, a small number of matching entity pairs are found in each round, and used as the basis for the next round of candidate entity pairs. However, traditional knowledge base alignment methods usually focus on the alignment of isomorphic knowledge bases, that is, there are more alignable relations between two knowledge bases. The basic assumption is: if a pair of entity pairs match and they have an aligned relationship, then their "compatible neighbors" have a higher probability of matching, so "compatible neighbors" are used as candidate entity pairs. However, due to the lack of alignable relationships between knowledge bases, traditional methods will miss some candidate entity pairs. To solve this problem, researchers propose to use a class-based knowledge base alignment method. This method classifies instances with the same characteristics into the same class and excludes candidate entities that are not related to the content of the class to confirm candidate entity pairs. However, since this method only obtains candidate entity pairs through classical partitioning techniques at the initial stage of the model, when there are fewer attributes aligned between the two knowledge bases, this method will also miss more candidate entity pairs.

发明内容Contents of the invention

鉴于上述，本发明提出了一种基于迭代匹配的大型异构知识库对齐方法。该方法结合迭代匹配思想来进行知识库对齐，使用迭代框架遍历关系对知识库进行分区，扩大了候选实体对的搜索空间；同时，采用使用分治思想挑选和确认候选实体对，使得每个实体仅需和几个候选实体进行全面比较，提高了方法的效率。In view of the above, the present invention proposes a large-scale heterogeneous knowledge base alignment method based on iterative matching. This method combines the idea of iterative matching to align the knowledge base, uses the iterative framework to traverse the relationship to partition the knowledge base, and expands the search space for candidate entity pairs; at the same time, uses the idea of divide and conquer to select and confirm candidate entity pairs, so that each entity Only a few candidate entities need to be comprehensively compared, increasing the efficiency of the method.

一种基于迭代匹配的大型异构知识库对齐方法，具体包括：A method for aligning large heterogeneous knowledge bases based on iterative matching, including:

数据预处理阶段：对任意两个原知识库KB₁、KB₂中的数据进行筛选、统一数据格式以及剔除无意义字符处理，并统计获取与处理后知识库KB′₁相对应的关系集R₁、与处理后知识库KB′₂相对应的关系集R₂，比较获取初始匹配实体对集 Data preprocessing stage: filter the data in any two original knowledge bases KB ₁ and KB ₂ , unify the data format and eliminate meaningless characters, and obtain the relation set R corresponding to the processed knowledge base KB′ ₁ _1. The relationship set R ₂ corresponding to the processed knowledge base KB′ ₂ is compared to obtain the initial matching entity pair set

知识库对齐阶段：利用关系集R₁与关系集R₂中的关系对知识库KB′₁和知识库KB′₂进行分区，并对每个区块进行精简，得到精简区块集B′₁和B′₂；然后，利用初始匹配实体对集匹配精简区块集B′₁和B′₂中的区块，得到匹配区块对，最后，在匹配区块对中挑选候选实体对，并结合相似度度量方法和阈值δ_e确认候选实体对。Knowledge base alignment stage: use the relationship in relation set R ₁ and relation set R ₂ to partition knowledge base KB′ ₁ and knowledge base KB′ ₂ , and simplify each block to obtain a simplified block set B′ ₁ and B′ ₂ ; then, use the initial matching entity pair set Match the blocks in the reduced block sets B′ ₁ and B′ ₂ to obtain matching block pairs. Finally, select candidate entity pairs from the matching block pairs, and combine the similarity measurement method and the threshold δ _e to confirm the candidate entity pairs .

所述的数据预处理阶段的具体步骤为：The concrete steps of the described data preprocessing stage are:

(1-1)输入任意两个原知识库KB₁、KB₂，并去除知识库KB₁、KB₂中与对齐任务无关的信息；(1-1) Input any two original knowledge bases KB ₁ and KB ₂ , and remove the information irrelevant to the alignment task in the knowledge bases KB ₁ and KB ₂ ;

(1-2)对知识库KB₁中的字面量L₁和知识库KB₂中的字面量L₂统一数据格式，将日期、数字、姓名表示为统一格式；(1-2) unify the data format to the literal quantity L ₁ in the knowledge base KB ₁ and the literal quantity L ₂ in the knowledge base KB ₂ , date, numeral, full name are represented as unified format;

(1-3)去除知识库KB₁中的字面量L₁和知识库KB₂中的字面量L₂中停用词字符、符号字符、语言标签字符，得到处理后知识库KB′₁和KB′₂；(1-3) Remove the stop word characters, symbol characters, and language label characters in the literal quantity L ₁ in the knowledge base KB ₁ and the literal quantity L ₂ in the knowledge base KB ₂ , and obtain the processed knowledge base KB′ ₁ and KB '₂;

(1-4)统计获取与知识库KB′₁相对的关系集R₁、与知识库KB′₂相对应的关系集R₂；(1-4) statistically obtain the relation set R ₁ corresponding to the knowledge base KB′ ₁ and the relation set R ₂ corresponding to the knowledge base KB′ ₂ ;

(1-5)比较知识库KB′₁与知识库KB′₂中的所有实体，获取初始匹配实体对集 (1-5) Compare all entities in knowledge base KB′ ₁ and knowledge base KB′ ₂ , and obtain the initial matching entity pair set

知识库定义为六元组(E,L,R,P,F_R,F_P)，其中，E,L,R,P分别表示实体、字面量、关系以及属性的集合；代表实体-关系-实体的三元组集合，表示宾语为实体的关系事实；代表实体-属性-字面量的三元组集合，表示宾语为字面量的属性事实；F_R和F_P中都存在无意义的信息，例如：某些知识库中包含用于抽取三元组的原文本语料，这些信息会影响算法的效率。另外，某些包含“sameAs”关系的三元组也应被去除。The knowledge base is defined as a six-tuple (E, L, R, P, F _R , F _P ), where E, L, R, and P respectively represent a collection of entities, literals, relationships, and attributes; Represents the triplet set of entity-relation-entity, and represents the relationship fact that the object is an entity; Represents the set of triples of entity-attribute-literal, and represents the fact that the object is a literal attribute; there are _meaningless information in both FR and _FP , for example: some knowledge bases contain triples for extracting The original text corpus, this information will affect the efficiency of the algorithm. In addition, some triples containing the "sameAs" relationship should also be removed.

所述步骤(1-4)的具体过程为：The concrete process of described step (1-4) is:

对于知识库KB′₁，遍历属于该知识库的三元组集合F_R1中的所有三元组(实体-关系-实体)，统计得到关系集R₁；对于知识库KB′₂，遍历属于该知识库的三元组集合F_R2中的所有三元组(实体-关系-实体)，统计得到关系集R₂，关系集R₁和关系集R₂用于后续的知识库分区操作。For the knowledge base KB′ ₁ , traverse all the triples (entity-relationship-entity) in the triple set F _R1 belonging to the knowledge base, and obtain the relation set R ₁ through statistics; for the knowledge base KB′ ₂ , traverse all the triples belonging to the All the triples (entity-relationship-entity) in the triplet set F _R2 of the knowledge base are counted to obtain the relational set R ₂ , the relational set R ₁ and the relational set R ₂ for subsequent knowledge base partitioning operations.

步骤(1-5)中，所述的初始匹配实体对集的获取过程为：In step (1-5), the initial matching entity pair set The acquisition process is:

首先，提取知识库KB′₁中的所有实体组成实体集E₁，提取知识库KB′₂中的所有实体组成实体集E₂；并以实体集E₁中的任一实体与实体集E₂中的任一实体的笛卡尔积作为实体对，组成实体对集；First, extract all the entities in the knowledge base KB′ ₁ to form the entity set E ₁ , extract all the entities in the knowledge base KB’ ₂ to form the entity set E ₂ ; and use any entity in the entity set E ₁ to form the entity set E ₂ The Cartesian product of any entity in is used as an entity pair to form an entity pair set;

然后，筛选获取实体对集中两实体姓名属性的字符串表示完全相同的实体对，得到预初始匹配实体对集；Then, filter and obtain the entity pairs whose strings of the two entity name attributes in the entity pair set represent identical entity pairs, and obtain the pre-initial matching entity pair set;

最后，筛选预初始匹配实体对集中具有一对一匹配关系的实体对，作为初始匹配实体对集 Finally, filter the entity pairs with one-to-one matching relationship in the pre-initial matching entity pair set as the initial matching entity pair set

所述的知识库对齐阶段的具体步骤为：The specific steps of the knowledge base alignment phase are as follows:

(2-1)输入知识库KB′₁、知识库KB′₂、关系集R₁、关系集R₂、初始匹配实体对集设置区块相似度阈值δ_b、实体相似度阈值δ_e、区块内实体数量阈值δ₁以及区块内已匹配实体比率阈值δ₂，匹配实体对集M_e初始化为初始匹配实体对集 (2-1) Input knowledge base KB′ ₁ , knowledge base KB′ ₂ , relation set R ₁ , relation set R ₂ , initial matching entity pair set Set the block similarity threshold δ _b , the entity similarity threshold δ _e , the threshold of the number of entities in the block δ ₁ , and the threshold of the ratio of matched entities in the block δ ₂ , the matching entity pair set M _e is initialized as the initial matching entity pair set

(2-2)随机选取关系集R₁或关系集R₂中的任一关系，利用该关系将知识库KB′₁和知识库KB′₂中的实体分成若干个区块，得到与知识库KB′₁相对应的区块集B₁、与知识库KB′₂相对应的区块集B₂；(2-2) Randomly select any relation in relation set R ₁ or relation set R ₂ , use this relation to divide entities in knowledge base KB′ ₁ and knowledge base KB′ ₂ into several blocks, and obtain Block set B ₁ corresponding to KB′ ₁ , block set B ₂ corresponding to knowledge base KB′ ₂ ;

(2-3)去除区块集B₁和区块集B₂中易产生高计算量或难以生成匹配实体对的区块，得到精简区块集B′₁和精简区块集B′₂；(2-3) Remove block sets B ₁ and block sets B ₂ that are prone to generate high computational load or blocks that are difficult to generate matching entity pairs, and obtain a simplified block set B' ₁ and a simplified block set B'₂;

(2-4)利用匹配实体对集M_e中的所有匹配实体对度量精简区块集B′₁中任一区块与精简区块集B′₂中任一区块之间的相似度，选择相似度值大于区块相似度阈值δ_b的两个区块进行匹配，得到匹配区块对集；( _2-4 ) Use all matching entity pairs in the matching entity pair set Me to measure the similarity between any block in the reduced block set _B'1 and any block in the reduced block set _B'2 , Select two blocks whose similarity value is greater than the block similarity threshold δ _b to match, and obtain a matching block pair set;

(2-5)对属于匹配区块对集中的任一匹配区块对，以该匹配区块对的一个区块中的任一未匹配实体与该匹配区块对的另一个区块中的任一未匹配实体的笛卡尔积作为候选实体对，组成候选实体对集；(2-5) For any matching block pair belonging to the matching block pair set, any unmatched entity in one block of the matching block pair and any unmatched entity in the other block of the matching block pair The Cartesian product of any unmatched entity is used as a candidate entity pair to form a candidate entity pair set;

(2-6)判断是否未发现新候选实体对，若否，跳转执行步骤(2-7)，若是，结束迭代，输出匹配实体对集M_e；(2-6) Judging whether no new candidate entity pair has been found, if not, jump to step (2-7), if so, end the iteration, and output the matching entity pair set M _e ;

(2-7)计算候选实体对集中每个候选实体对中两实体之间的相似度，将相似度值大于实体相似度阈值δ_e对应的候选实体对添加至匹配实体对集M_e中，剩下的候选实体对舍弃；(2-7) Calculate the similarity between two entities in each candidate entity pair in the candidate entity pair set, and add the candidate entity pair whose similarity value is greater than the entity similarity threshold δ _e to the matching entity pair set M _e , The remaining candidate entity pairs are discarded;

(2-8)判断迭代次数是否小于迭代阈值，都否，跳转执行步骤(2-2)；若是，结束迭代，输出匹配实体对集M_e。(2-8) Determine whether the number of iterations is less than the iteration threshold, if not, skip to step (2-2); if yes, end the iteration, and output the matching entity pair set M _e .

步骤(2-2)中，所述的利用关系将知识库中的实体分成若干个区块的具体过程为；In step (2-2), the specific process of dividing the entity in the knowledge base into several blocks by using the relationship is;

首先，对于知识库KB′₁中的三元组集合F_R1，统计得到三元组集合F_R1中n种宾语实体；First, for the triplet set F _R1 in the knowledge base KB′ ₁ , get n types of object entities in the triplet set F _R1 through statistics;

然后，对于每种宾语实体，将三元组集合F_R1中与其相对应的所有主语实体放在一起，得到1个区块，n种宾语实体得到n个区块，组成区块集B₁；Then, for each object entity, put together all the subject entities corresponding to it in the triple set _FR1 to obtain 1 block, and n types of object entities to obtain n blocks to form a block set B ₁ ;

利用同样的方法得到区块集B₂。Use the same method to get the block set B ₂ .

步骤(2-3)中，所述的易产生高计算量或难以生成匹配实体对的区块包括：实体数量超过阈值δ₁的区块、已匹配实体比率小于阈值δ₂的区块以及实体都已经匹配的区块。In step (2-3), the blocks that are prone to high computational load or difficult to generate matching entity pairs include: blocks with the number of entities exceeding the threshold δ1, blocks with the ratio of matched entities less _than the threshold _δ2 , and entities blocks that are already matched.

步骤(2-4)中，区块间的相似度的获取方法为：In step (2-4), the method for obtaining the similarity between blocks is:

将每个区块看是实体的集合，已经匹配的实体对看作是两集合间的相同元素，利用集合相似度来度量区块间的相似度，相似度sim_block(b_k,b_l)的计算公式为：Each block is regarded as a collection of entities, and the matched entity pairs are regarded as the same elements between the two collections, and the similarity between blocks is measured by using the similarity of the collection, similarity sim _block (b _k , b _l ) The calculation formula is:

步骤(2-7)中，实体之间的相似度的获取公式为：In step (2-7), the formula for obtaining the similarity between entities is:

sim(e_i,e_j)＝αsim_string(e_i,e_j)+(1-α)sim_block(b_k,b_l)sim(e _i ,e _j )=αsim _string (e _i ,e _j )+(1-α)sim _block (b _k ,b _l )

s.t.e_i∈b_k,e_j∈b_l ste _i ∈ b _k , e _j ∈ b _l

其中，b_k和b_l分别表示实体e_i和e_j所在的区块，sim_string(e_i,e_j)和sim_block(b_k,b_l)分别表示实体间的字符串相似度和区块相似度，α是字符串相似度的权重，取值范围为[0,1]。Among them, b _k and b _l represent the blocks where entities e _i and e _j are located respectively, and sim _string (e _i , e _j ) and sim _block (b _k , b _l ) represent the string similarity and block Block similarity, α is the weight of string similarity, and the value range is [0,1].

作为优选，采用基于Levenshtein距离、基于Jaro-Winker距离、基于q-gram及基于I-SUB的相似度函数，并通过线性加权的方式组合这些相似度度量函数计算获得字符串相似度。Preferably, similarity functions based on Levenshtein distance, Jaro-Winker distance, q-gram and I-SUB are used, and these similarity measurement functions are combined in a linear weighted manner to obtain string similarity.

本发明结合迭代匹配思想进行异构知识库对齐，使用迭代框架遍历关系对知识库进行分区，扩大了候选实体对的搜索空间；同时，采用使用分治思想挑选和确认候选实体对，使得每个实体仅需和几个候选实体进行全面比较，提高了方法的效率。与现有的方法相比，其优点在于：The invention combines the idea of iterative matching to align heterogeneous knowledge bases, uses the iterative framework to traverse the relationship to partition the knowledge base, and expands the search space for candidate entity pairs; at the same time, adopts the idea of divide and conquer to select and confirm candidate entity pairs, so that each Entities only need to be comprehensively compared with a few candidate entities, which improves the efficiency of the method. Compared with existing methods, its advantages are:

(1)将知识库对齐看成一个迭代过程。在不同迭代中，遍历各关系对知识库进行分区，并利用匹配的区块对挑选候选实体对，使得对齐方法不依赖于知识库间可对齐的关系和属性。(1) View knowledge base alignment as an iterative process. In different iterations, each relation is traversed to partition the knowledge base, and the matching block pairs are used to select candidate entity pairs, so that the alignment method does not depend on the alignable relations and attributes between the knowledge bases.

(2)在每轮迭代中，仅发现少量匹配实体对，并将这些匹配实体对用于候选实体对的挑选，由于挑选候选实体对的过程使用了更多匹配实体对的信息，因此提高了候选实体对的质量。(2) In each round of iteration, only a small number of matching entity pairs are found, and these matching entity pairs are used for the selection of candidate entity pairs. Since the process of selecting candidate entity pairs uses more information about matching entity pairs, the improvement of The quality of the candidate entity pairs.

附图说明Description of drawings

图1是本发明基于迭代匹配的大型异构知识库对齐方法的流程框图；Fig. 1 is the flow diagram of the method for aligning large-scale heterogeneous knowledge bases based on iterative matching in the present invention;

图2是本发明基于迭代匹配的大型异构知识库对齐方法中数据预处理阶段的流程图；Fig. 2 is a flow chart of the data preprocessing stage in the large-scale heterogeneous knowledge base alignment method based on iterative matching in the present invention;

图3是本发明基于迭代匹配的大型异构知识库对齐方法中知识库对齐阶段的流程图。Fig. 3 is a flowchart of the knowledge base alignment stage in the iterative matching-based large-scale heterogeneous knowledge base alignment method of the present invention.

具体实施方式Detailed ways

为了更为具体地描述本发明，下面结合附图及具体实施方式对本发明的技术方案进行详细说明。In order to describe the present invention more specifically, the technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示，本发明基于迭代匹配的大型异构知识库对齐方法分为数据预处理和知识库对齐两个部分。数据预处理部分：对原知识库KB中的数据进行筛选、统一数据格式，并获取知识库中的关系和初始匹配实体对；知识库对齐部分：首先利用知识库中的关系对预处理后的知识库分区，并精简区块，然后利用已匹配实体对匹配区块，得到匹配区块对，接着在匹配区块对中挑选候选实体对，并结合相似度度量方法和阈值确认候选实体对，最后重复上述步骤，直到不能发现新的候选实体对，即可得到所有匹配实体对。As shown in FIG. 1 , the method for aligning large-scale heterogeneous knowledge bases based on iterative matching in the present invention is divided into two parts: data preprocessing and knowledge base alignment. Data preprocessing part: filter the data in the original knowledge base KB, unify the data format, and obtain the relationship in the knowledge base and the initial matching entity pair; knowledge base alignment part: first use the relationship in the knowledge base to pair the preprocessed The knowledge base is partitioned, and the blocks are simplified, and then the matched entity pairs are used to match the blocks to obtain the matched block pairs, and then the candidate entity pairs are selected from the matched block pairs, and the candidate entity pairs are confirmed by combining the similarity measurement method and the threshold value, Finally, the above steps are repeated until no new candidate entity pairs can be found, and all matching entity pairs can be obtained.

图2所示的是数据预处理阶段的流程图；根据图2，该阶段分为以下步骤：Figure 2 shows the flow chart of the data preprocessing stage; according to Figure 2, this stage is divided into the following steps:

S1-1，输入任意两个原知识库KB₁、KB₂，并去除知识库KB₁、KB₂中与对齐任务无关的信息。S1-1, input any two original knowledge bases KB ₁ and KB ₂ , and remove information irrelevant to the alignment task in the knowledge bases KB ₁ and KB ₂ .

知识库定义为六元组(E,L,R,P,F_R,F_P)，其中，E,L,R,P分别表示实体、字面量、关系以及属性的集合；代表实体-关系-实体的三元组集合，表示宾语为实体的关系事实；代表实体-属性-字面量的三元组集合，表示宾语为字面量的属性事实；F_R和F_P中都存在无意义的信息，例如：某些知识库中包含用于抽取三元组的原文本语料，这些信息会影响算法的效率。另外，某些包含“same As”关系的三元组也应被去除。The knowledge base is defined as a six-tuple (E, L, R, P, F _R , F _P ), where E, L, R, and P respectively represent a collection of entities, literals, relationships, and attributes; Represents the triplet set of entity-relation-entity, and represents the relationship fact that the object is an entity; Represents the set of triples of entity-attribute-literal, and represents the fact that the object is a literal attribute; there are _meaningless information in both FR and _FP , for example: some knowledge bases contain triples for extracting The original text corpus, this information will affect the efficiency of the algorithm. In addition, some triples containing "same As" relations should also be removed.

S1-2，对知识库KB₁中的字面量L₁和知识库KB₂中的字面量L₂统一数据格式，将日期、数字、姓名表示为统一格式。 _S1-2 , unify the data format for the literal quantity L1 in the knowledge base _KB1 and the literal quantity L2 in the knowledge base _KB2 , and express the date, number, and name in _a unified format.

不同知识库中的姓名、日期、数字等字面量的表达方式可能不同，例如：“2016-01-01”和“01.01.2016”。将这些信息统一，利于后续比较，另外，方法将字面量统一成小写。Literal values such as names, dates, and numbers may be expressed in different ways in different knowledge bases, for example: "2016-01-01" and "01.01.2016". Unifying these information is beneficial for subsequent comparisons. In addition, the method unifies literals into lowercase.

S1-3，去除知识库KB₁中的字面量L₁和知识库KB₂中的字面量L₂中停用词字符、符号字符、语言标签等无意义字符，得到处理后知识库KB′₁和KB′₂。S1-3, remove meaningless characters such as stop word characters, symbol characters, and language tags in the literal amount L ₁ in the knowledge base KB ₁ and the literal amount L ₂ in the knowledge base KB ₂ , and obtain the processed knowledge base KB′ ₁ and KB′ ₂ .

知识库中对于实体的属性描述中可能会存在一些无意义字符，例如：“the”、“a”和“an”等停用词，“#”、“！”和“*”等符号以及“@en”等语言标签。这些字符影响实体对的相似度度量，因此去除这些字符。There may be some meaningless characters in the attribute description of the entity in the knowledge base, such as stop words such as "the", "a" and "an", symbols such as "#", "!" and "*", and " @en" and other language tags. These characters affect the similarity measure of entity pairs, so these characters are removed.

S1-4，统计获取与知识库KB′₁相对的关系集R₁、与知识库KB′₂相对应的关系集R₂。S1-4. Obtain statistically the relation set R ₁ corresponding to the knowledge base KB′ ₁ and the relation set R ₂ corresponding to the knowledge base KB′ ₂ .

此步骤中，对于知识库KB′₁，遍历属于该知识库的三元组集合F_R1中的所有三元组(实体-关系-实体)，统计得到关系集R₁；对于知识库KB′₂，遍历属于该知识库的三元组集合F_R2中的所有三元组(实体-关系-实体)，统计得到关系集R₂，关系集R₁和关系集R₂用于后续的知识库分区操作。In this step, for the knowledge base KB′ ₁ , traverse all triples (entity-relationship-entity) in the triple set F _R1 belonging to the knowledge base, and obtain the relation set R ₁ through statistics; for the knowledge base KB′ ₂ , traverse all triples (entity-relationship-entity) in the triplet set F _R2 belonging to the knowledge base, and obtain the relation set R ₂ , relation set R ₁ and relation set R ₂ for subsequent knowledge base partitioning operate.

S1-5，比较知识库KB′₁与知识库KB′₂中的所有实体，获取初始匹配实体对集 S1-5, compare all the entities in the knowledge base KB′ ₁ and the knowledge base KB′ ₂ , and obtain the initial matching entity pair set

此步骤中，初始匹配实体对集的获取过程为：In this step, the initial matching entity pair set The acquisition process is:

图3所示的是知识库对齐阶段的流程图；根据图3，该阶段分为以下步骤：Figure 3 shows the flowchart of the knowledge base alignment phase; according to Figure 3, this phase is divided into the following steps:

S2-1，输入知识库KB′₁、知识库KB′₂、关系集R₁、关系集R₂、初始匹配实体对集设置区块相似度阈值δ_b为0.2、实体相似度阈值δ_e为0.65、区块内实体数量阈值δ₁为50以及区块内已匹配实体比率阈值δ₂为0.3，匹配实体对集M_e初始化为初始匹配实体对集 S2-1, input knowledge base KB′ ₁ , knowledge base KB′ ₂ , relation set R ₁ , relation set R ₂ , initial matching entity pair set Set the block similarity threshold δ _b to 0.2, the entity similarity threshold δ _e to 0.65, the threshold of the number of entities in the block δ ₁ to 50, and the threshold of the ratio of matched entities in the block δ ₂ to 0.3, the matching entity pair set M _e initialized to the initial set of matching entity pairs

S2-2，随机选取关系集R₁或关系集R₂中的任一关系，利用该关系将知识库KB′₁和知识库KB′₂中的实体分成若干个区块，得到与知识库KB′₁相对应的区块集B₁、与知识库KB′₂相对应的区块集B₂。S2-2. Randomly select any relation in relation set R ₁ or relation set R ₂ , use this relation to divide entities in knowledge base KB′ ₁ and knowledge base KB′ ₂ into several blocks, and obtain The block set B ₁ corresponding to ′ ₁ and the block set B ₂ corresponding to the knowledge base KB′ ₂ .

此步骤中，利用关系将知识库中的实体分成若干个区块的具体过程为；In this step, the specific process of dividing the entities in the knowledge base into several blocks by using the relationship is as follows;

首先，对于知识库KR′₁中的三元组集合F_R1，统计得到三元组集合F_R1中n种宾语实体；First, for the triplet set F _R1 in the knowledge base KR′ ₁ , get n types of object entities in the triplet set F _R1 through statistics;

然后，对于每种宾语实体，将三元组集合F_R1中与其相对应的所有主语实体放在一起，得到1个区块，n种宾语实体得到n个区块，组成区块集B₁。Then, for each object entity, put together all the subject entities corresponding to it in the triple set _FR1 to obtain 1 block, and n types of object entities to obtain n blocks to form block set B ₁ .

利用同样的方法得到区块集B₂，即：Use the same method to get the block set B ₂ , namely:

首先，对于知识库KB′₂中的三元组集合F_R2，统计得到三元组集合F_R2中n种宾语实体；First, for the triplet set F _R2 in the knowledge base KB′ ₂ , get n types of object entities in the triplet set F _R2 through statistics;

然后，对于每种宾语实体，将三元组集合F_R2中与其相对应的所有主语实体放在一起，得到1个区块，n种宾语实体得到n个区块，组成区块集B₂。Then, for each object entity, put together all the subject entities corresponding to it in the triple set _FR2 to obtain 1 block, and n types of object entities to obtain n blocks to form block set B ₂ .

S2-3，去除区块集B₁和区块集B₂中易产生高计算量或难以生成匹配实体对的区块，得到精简区块集B′₁和精简区块集B′₂。S2-3, remove the blocks in the block set B ₁ and the block set B ₂ that are prone to high computational load or difficult to generate matching entity pairs, and obtain the simplified block set B' ₁ and the simplified block set B' ₂ .

此步骤中，易产生高计算量或难以生成匹配实体对的区块包括：实体数量超过阈值δ₁的区块、已匹配实体比率小于阈值δ₂的区块以及实体都已经匹配的区块。In this step, the blocks that tend to generate high computational load or are difficult to generate matching entity pairs include: blocks with the number of entities exceeding the threshold δ1, blocks with the ratio of matched entities less _than the threshold _δ2 , and blocks with entities that have already been matched.

S2-4，利用匹配实体对集M_e中的所有匹配实体对度量精简区块集B′₁中任一区块与精简区块集B′₂中任一区块之间的相似度，选择相似度值大于区块相似度阈值δ_b的两个区块进行匹配，得到匹配区块对集。S2-4, using all matching entity pairs in the matching entity pair set M _e to measure the similarity between any block in the reduced block set B′ ₁ and any block in the reduced block set B′ ₂ , select Two blocks whose similarity value is greater than the block similarity threshold δ _b are matched to obtain a matching block pair set.

S2-5，对属于匹配区块对集中的任一匹配区块对，以该匹配区块对的一个区块中的任一未匹配实体与该匹配区块对的另一个区块中的任一未匹配实体的笛卡尔积作为候选实体对，组成候选实体对集。S2-5, for any matching block pair belonging to the matching block pair set, any unmatched entity in one block of the matching block pair and any unmatched entity in the other block of the matching block pair The Cartesian product of an unmatched entity is used as a candidate entity pair to form a candidate entity pair set.

S2-6，判断是否未发现新候选实体对，若否，跳转执行S2-7，若是，结束迭代，输出匹配实体对集M_e。S2-6, judging whether no new candidate entity pair is found, if not, skip to S2-7, if so, end the iteration, and output the matching entity pair set M _e .

S2-7，计算候选实体对集中每个候选实体对中两实体之间的相似度，将相似度值大于实体相似度阈值δ_e对应的候选实体对添加至匹配实体对集M_e中，剩下的候选实体对舍弃。S2-7. Calculate the similarity between two entities in each candidate entity pair in the candidate entity pair set, and add the candidate entity pair whose similarity value is greater than the entity similarity threshold δ _e to the matching entity pair set M _e , and the remaining The following candidate entity pairs are discarded.

此步骤中，实体之间的相似度通过2种方式进行度量：字符串相似度和区块相似度，并以一定的权重组合这两种相似度，其公式如下：In this step, the similarity between entities is measured in two ways: string similarity and block similarity, and these two similarities are combined with a certain weight. The formula is as follows:

s.t.e_i∈b_k,e_j∈b_l ste _i ∈ b _k , e _j ∈ b _l

其中，sim_string(e_i,e_j)和sim_block(b_k,b_l)分别表示实体间的字符串相似度和区块相似度，b_k和b_l分别表示实体e_i和e_j所在的区块，α是字符串相似度的权重，取值为0.6。针对实体e_i和e_j共有的属性对(例如：姓名)，字符串相似度度量这些属性值的相似度。方法使用了多种相似度度量函数，如：基于Levenshtein距离、基于Jaro-Winker距离、基于q-gram及基于I-SUB的相似度函数，并通过线性加权的方式组合这些相似度度量函数。区块相似度通过实体所在区块间的相似度来表示实体的相似度。得到实体间的相似度之后，结合阈值δ_e判断该对实体是否匹配，并将新发现的匹配实体对加入所有匹配实体对。Among them, sim _string (e _i , e _j ) and sim _block (b _k , b _l ) represent the string similarity and block similarity between entities respectively, and b _k and b _l represent the locations of entities e _i and e _j respectively block, α is the weight of string similarity, with a value of 0.6. For the attribute pairs shared by entities e _i and e _j (for example: name), the string similarity measures the similarity of these attribute values. The method uses a variety of similarity measurement functions, such as: based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and based on I-SUB, and combined these similarity measurement functions by linear weighting. Block similarity represents the similarity of entities through the similarity between the blocks where entities are located. After obtaining the similarity between entities, combine the threshold δ _e to judge whether the pair of entities match, and add the newly discovered matching entity pair to all matching entity pairs.

S2-8，判断迭代次数是否小于迭代阈值，都否，跳转执行S2-2；若是，结束迭代，输出匹配实体对集M_e。S2-8, judging whether the number of iterations is less than the iteration threshold, if not, skip to S2-2; if so, end the iteration, and output the matching entity pair set M _e .

以上所述的具体实施方式对本发明的技术方案和有益效果进行了详细说明，应理解的是以上所述仅为本发明的最优选实施例，并不用于限制本发明，凡在本发明的原则范围内所做的任何修改、补充和等同替换等，均应包含在本发明的保护范围之内。The above-mentioned specific embodiments have described the technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned are only the most preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, supplements and equivalent replacements made within the scope shall be included in the protection scope of the present invention.

Claims

1. A large-scale heterogeneous knowledge base alignment method based on iterative matching, including:

Data preprocessing stage: filter the data in any two original knowledge bases KB ₁ and KB ₂ , unify the data format and eliminate meaningless characters, and obtain the relation set R corresponding to the processed knowledge base KB′ ₁ _1. The relationship set R ₂ corresponding to the processed knowledge base KB′ ₂ is compared to obtain the initial matching entity pair set

Knowledge base alignment stage: use the relationship in relation set R ₁ and relation set R ₂ to partition knowledge base KB′ ₁ and knowledge base KB′ ₂ , and simplify each block to obtain a simplified block set B′ ₁ and B′ ₂ ; then, use the initial matching entity pair set Match the blocks in the reduced block sets B' ₁ and B' ₂ to obtain matching block pairs. Finally, select candidate entity pairs from the matching block pairs, and combine similarity measurement methods and thresholds to confirm candidate entity pairs.

2. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 1, is characterized in that, the specific steps of described data preprocessing stage are:

(1-1) Input any two original knowledge bases KB ₁ and KB ₂ , and remove the information irrelevant to the alignment task in the knowledge bases KB ₁ and KB ₂ ;

(1-2) unify the data format to the literal quantity L ₁ in the knowledge base KB ₁ and the literal quantity L ₂ in the knowledge base KB ₂ , date, numeral, full name are represented as unified format;

(1-3) Remove the stop word characters, symbol characters, and language label characters in the literal quantity L ₁ in the knowledge base KB ₁ and the literal quantity L ₂ in the knowledge base KB ₂ , and obtain the processed knowledge base KB′ ₁ and KB '₂;

(1-4) statistically obtain the relation set R ₁ corresponding to the knowledge base KB′ ₁ and the relation set R ₂ corresponding to the knowledge base KB′ ₂ ;

(1-5) Compare all entities in knowledge base KB′ ₁ and knowledge base KB′ ₂ , and obtain the initial matching entity pair set

3. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 2, is characterized in that, the specific process of described step (1-4) is:

For the knowledge base KB′ ₁ , traverse all the entity-relationship-entity triples in the triple set F _R1 belonging to the knowledge base, and obtain the relation set R ₁ through statistics; for the knowledge base KB′ ₂ , traverse all the triples belonging to the knowledge All the entity-relationship-entity triplets in the triplet set F _R2 of the library are counted to obtain the relational set R ₂ .

4. The large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 2, characterized in that, in step (1-5), the initial matching entity pair set The acquisition process is:

First, extract all the entities in the knowledge base KB′ ₁ to form the entity set E ₁ , extract all the entities in the knowledge base KB’ ₂ to form the entity set E ₂ ; and use any entity in the entity set E ₁ to form the entity set E ₂ The Cartesian product of any entity in is used as an entity pair to form an entity pair set;

Then, filter and obtain the entity pairs whose strings of the two entity name attributes in the entity pair set represent identical entity pairs, and obtain the pre-initial matching entity pair set;

Finally, filter the entity pairs with one-to-one matching relationship in the pre-initial matching entity pair set as the initial matching entity pair set

5. The large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 1, wherein the specific steps of the knowledge base alignment stage are:

(2-1) Input knowledge base KB′ ₁ , knowledge base KB′ ₂ , relation set R ₁ , relation set R ₂ , initial matching entity pair set Set the block similarity threshold δ _b , the entity similarity threshold δ _e , the threshold of the number of entities in the block δ ₁ , and the threshold of the ratio of matched entities in the block δ ₂ , the matching entity pair set M _e is initialized as the initial matching entity pair set

(2-2) Randomly select any relation in relation set R ₁ or relation set R ₂ , use this relation to divide entities in knowledge base KB′ ₁ and knowledge base KB′ ₂ into several blocks, and obtain Block set B ₁ corresponding to KB′ ₁ , block set B ₂ corresponding to knowledge base KB′ ₂ ;

(2-3) Remove block sets B ₁ and block sets B ₂ that are prone to generate high computational load or blocks that are difficult to generate matching entity pairs, and obtain a simplified block set B' ₁ and a simplified block set B'₂;

( _2-4 ) Use all matching entity pairs in the matching entity pair set Me to measure the similarity between any block in the reduced block set _B'1 and any block in the reduced block set _B'2 , Select two blocks whose similarity value is greater than the block similarity threshold δ _b to match, and obtain a matching block pair set;

(2-5) For any matching block pair belonging to the matching block pair set, any unmatched entity in one block of the matching block pair and any unmatched entity in the other block of the matching block pair The Cartesian product of any unmatched entity is used as a candidate entity pair to form a candidate entity pair set;

(2-6) Judging whether no new candidate entity pair has been found, if not, jump to step (2-7), if so, end the iteration, and output the matching entity pair set M _e ;

(2-7) Calculate the similarity between two entities in each candidate entity pair in the candidate entity pair set, and add the candidate entity pair whose similarity value is greater than the entity similarity threshold δ _e to the matching entity pair set M _e , The remaining candidate entity pairs are discarded;

(2-8) Determine whether the number of iterations is less than the iteration threshold, if not, skip to step (2-2); if yes, end the iteration, and output the matching entity pair set M _e .

6. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 5, is characterized in that, in step (2-2), described utilization relation divides the entity in the knowledge base into several blocks The specific process is;

First, for the triplet set F _R1 in the knowledge base KB′ ₁ , get n types of object entities in the triplet set F _R1 through statistics;

Then, for each object entity, put together all the subject entities corresponding to it in the triple set _FR1 to obtain 1 block, and n types of object entities to obtain n blocks to form a block set B ₁ ;

Use the same method to get the block set B ₂ .

7. The large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 5, characterized in that, in step (2-3), the blocks that are prone to generate high computational load or are difficult to generate matching entity pairs Including: blocks with the number of entities exceeding the threshold δ ₁ , blocks with the ratio of matched entities less than the threshold δ ₂ , and blocks with entities that have already been matched.

8. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 5, is characterized in that, in step (2-4), the acquisition method of the similarity between blocks is:

Each block is regarded as a collection of entities, and the matched entity pairs are regarded as the same elements between the two collections, and the similarity between blocks is measured by using the similarity of the collection, similarity sim _block (b _k , b _l ) The calculation formula is:

Among them, b _k and b _l represent two blocks, |b _k ∩ b _l | represents the number of matching entity pairs in the two blocks, and |b _k ∪ b _l | represents the total number of entities in the two blocks.

9. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 8, is characterized in that, in step (2-7), the acquisition formula of the similarity between entities is:

sim(e _i ,e _j )=αsim _string (e _i ,e _j )+(1-α)sim _block (b _k ,b _l )

ste _i ∈ b _k , e _j ∈ b _l

Among them, b _k and b _l represent the blocks where entities e _i and e _j are located respectively, and sim _string (e _i , e _j ) and sim _block (b _k , b _l ) represent the string similarity and block Block similarity, α is the weight of string similarity, and the value range is [0,1].

10. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 9, is characterized in that, adopts the similarity function based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and based on I-SUB, And the string similarity is obtained by combining these similarity measurement functions in a linear weighted manner.