[go: up one dir, main page]

CN107145523B - Alignment method for large heterogeneous knowledge bases based on iterative matching - Google Patents

Alignment method for large heterogeneous knowledge bases based on iterative matching Download PDF

Info

Publication number
CN107145523B
CN107145523B CN201710237034.6A CN201710237034A CN107145523B CN 107145523 B CN107145523 B CN 107145523B CN 201710237034 A CN201710237034 A CN 201710237034A CN 107145523 B CN107145523 B CN 107145523B
Authority
CN
China
Prior art keywords
entity
knowledge base
matching
block
entities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710237034.6A
Other languages
Chinese (zh)
Other versions
CN107145523A (en
Inventor
陈岭
顾伟东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201710237034.6A priority Critical patent/CN107145523B/en
Publication of CN107145523A publication Critical patent/CN107145523A/en
Application granted granted Critical
Publication of CN107145523B publication Critical patent/CN107145523B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于迭代匹配的大型异构知识库对齐方法,具体实施如下:1)对原知识库中的数据进行筛选、统一数据格式,在此基础上获取知识库中的关系和初始匹配实体对;2)利用知识库中的关系对预处理后的知识库分区,并精简区块;3)利用已匹配实体对匹配区块,得到匹配区块对;4)在匹配区块对中挑选候选实体对,并结合相似度度量方法和阈值确认候选实体对;5)重复上述步骤,直到不能发现新的候选实体对,得到所有匹配实体对。本发明结合迭代匹配的思想对齐异构知识库,在知识库对齐、数据融合、自动问答等领域具有广阔的应用前景。

The invention discloses a large-scale heterogeneous knowledge base alignment method based on iterative matching. The specific implementation is as follows: 1) Screen the data in the original knowledge base, unify the data format, and obtain the relationship and initial knowledge base in the knowledge base on this basis. Match entity pairs; 2) use the relationship in the knowledge base to partition the preprocessed knowledge base, and simplify the blocks; 3) use the matched entity pairs to match blocks to obtain matching block pairs; 4) match block pairs Select candidate entity pairs, and combine the similarity measurement method and threshold to confirm candidate entity pairs; 5) Repeat the above steps until no new candidate entity pairs can be found, and all matching entity pairs are obtained. The invention combines the idea of iterative matching to align heterogeneous knowledge bases, and has broad application prospects in the fields of knowledge base alignment, data fusion, automatic question answering and the like.

Description

基于迭代匹配的大型异构知识库对齐方法Alignment method for large heterogeneous knowledge bases based on iterative matching

技术领域technical field

本发明涉及知识库对齐领域,尤其涉及一种基于迭代匹配的大型异构知识库对齐方法。The invention relates to the field of knowledge base alignment, in particular to a large-scale heterogeneous knowledge base alignment method based on iterative matching.

背景技术Background technique

随着Web 3.0的到来,结构化的知识库越来越频繁地出现在互联网上。这些知识库被广泛应用于各类语义应用中,例如:自动问答、搜索服务和社交服务等。然而,单个知识库的信息有限,限制了这些应用的功能。在此背景下,知识库对齐有了巨大的发展空间。知识库对齐(Knowledge Base Alignment)通常指知识库的实体对齐,即自动发现代表现实中同一事物的两个实体并连接它们。With the advent of Web 3.0, structured knowledge bases are appearing more and more frequently on the Internet. These knowledge bases are widely used in various semantic applications, such as automatic question answering, search services, and social services. However, the limited information of a single knowledge base limits the functionality of these applications. In this context, knowledge base alignment has huge room for development. Knowledge Base Alignment (Knowledge Base Alignment) usually refers to the entity alignment of the knowledge base, that is, to automatically discover two entities representing the same thing in reality and connect them.

由于知识库规模的不断增长,知识库对齐方法通常将对齐过程分为两个步骤:发现候选实体对和确认候选实体对。发现候选实体对通常利用少量属性快速为每个实体筛选出几个候选实体,确认候选实体对通过全面比较两个实体,利用相似度和阈值判断两实体是否匹配。由于避免了实体两两之间的精确比较,这种做法大大提高了方法的整体效率。目前,知识库对齐方法的瓶颈在于发现的候选实体对常常有所遗漏,进一步导致可匹配的实体对未被发现。Due to the ever-increasing size of knowledge bases, knowledge base alignment methods usually divide the alignment process into two steps: discovering candidate entity pairs and confirming candidate entity pairs. Discovering candidate entity pairs usually uses a small number of attributes to quickly screen out several candidate entities for each entity, and confirming candidate entity pairs comprehensively compares two entities, and uses similarity and threshold to determine whether the two entities match. This approach greatly improves the overall efficiency of the method by avoiding exact pairwise comparisons of entities. At present, the bottleneck of the knowledge base alignment method is that the discovered candidate entity pairs are often missed, which further leads to the undiscovered matching entity pairs.

为提高候选实体对的质量,研究人员提出使用迭代匹配的思想,即每轮发现少量的匹配实体对,并作为下一轮发现候选实体对的依据。然而,传统的知识库对齐方法通常关注同构知识库对齐,即两知识库间有较多可对齐关系。其基本假设为:如果一对实体对匹配,并且它们有对齐的关系,那么它们的“兼容邻居”有较大概率匹配,因此将“兼容邻居”作为候选实体对。但是,由于知识库间可对齐关系少,传统方法将遗漏部分候选实体对。为了解决该问题,研究人员提出使用基于类的知识库对齐方法。该方法将具有相同特征的实例划分到同一个类中,并排除与类的内容不相关的候选实体,以此来确认候选实体对。然而,由于该方法仅在模型初始阶段通过经典的分区技术获取候选实体对,因此当两个知识库间对齐的属性较少时,该方法也将遗漏较多的候选实体对。In order to improve the quality of candidate entity pairs, researchers propose the idea of using iterative matching, that is, a small number of matching entity pairs are found in each round, and used as the basis for the next round of candidate entity pairs. However, traditional knowledge base alignment methods usually focus on the alignment of isomorphic knowledge bases, that is, there are more alignable relations between two knowledge bases. The basic assumption is: if a pair of entity pairs match and they have an aligned relationship, then their "compatible neighbors" have a higher probability of matching, so "compatible neighbors" are used as candidate entity pairs. However, due to the lack of alignable relationships between knowledge bases, traditional methods will miss some candidate entity pairs. To solve this problem, researchers propose to use a class-based knowledge base alignment method. This method classifies instances with the same characteristics into the same class and excludes candidate entities that are not related to the content of the class to confirm candidate entity pairs. However, since this method only obtains candidate entity pairs through classical partitioning techniques at the initial stage of the model, when there are fewer attributes aligned between the two knowledge bases, this method will also miss more candidate entity pairs.

发明内容Contents of the invention

鉴于上述,本发明提出了一种基于迭代匹配的大型异构知识库对齐方法。该方法结合迭代匹配思想来进行知识库对齐,使用迭代框架遍历关系对知识库进行分区,扩大了候选实体对的搜索空间;同时,采用使用分治思想挑选和确认候选实体对,使得每个实体仅需和几个候选实体进行全面比较,提高了方法的效率。In view of the above, the present invention proposes a large-scale heterogeneous knowledge base alignment method based on iterative matching. This method combines the idea of iterative matching to align the knowledge base, uses the iterative framework to traverse the relationship to partition the knowledge base, and expands the search space for candidate entity pairs; at the same time, uses the idea of divide and conquer to select and confirm candidate entity pairs, so that each entity Only a few candidate entities need to be comprehensively compared, increasing the efficiency of the method.

一种基于迭代匹配的大型异构知识库对齐方法,具体包括:A method for aligning large heterogeneous knowledge bases based on iterative matching, including:

数据预处理阶段:对任意两个原知识库KB1、KB2中的数据进行筛选、统一数据格式以及剔除无意义字符处理,并统计获取与处理后知识库KB′1相对应的关系集R1、与处理后知识库KB′2相对应的关系集R2,比较获取初始匹配实体对集 Data preprocessing stage: filter the data in any two original knowledge bases KB 1 and KB 2 , unify the data format and eliminate meaningless characters, and obtain the relation set R corresponding to the processed knowledge base KB′ 1 1. The relationship set R 2 corresponding to the processed knowledge base KB′ 2 is compared to obtain the initial matching entity pair set

知识库对齐阶段:利用关系集R1与关系集R2中的关系对知识库KB′1和知识库KB′2进行分区,并对每个区块进行精简,得到精简区块集B′1和B′2;然后,利用初始匹配实体对集匹配精简区块集B′1和B′2中的区块,得到匹配区块对,最后,在匹配区块对中挑选候选实体对,并结合相似度度量方法和阈值δe确认候选实体对。Knowledge base alignment stage: use the relationship in relation set R 1 and relation set R 2 to partition knowledge base KB′ 1 and knowledge base KB′ 2 , and simplify each block to obtain a simplified block set B′ 1 and B′ 2 ; then, use the initial matching entity pair set Match the blocks in the reduced block sets B′ 1 and B′ 2 to obtain matching block pairs. Finally, select candidate entity pairs from the matching block pairs, and combine the similarity measurement method and the threshold δ e to confirm the candidate entity pairs .

所述的数据预处理阶段的具体步骤为:The concrete steps of the described data preprocessing stage are:

(1-1)输入任意两个原知识库KB1、KB2,并去除知识库KB1、KB2中与对齐任务无关的信息;(1-1) Input any two original knowledge bases KB 1 and KB 2 , and remove the information irrelevant to the alignment task in the knowledge bases KB 1 and KB 2 ;

(1-2)对知识库KB1中的字面量L1和知识库KB2中的字面量L2统一数据格式,将日期、数字、姓名表示为统一格式;(1-2) unify the data format to the literal quantity L 1 in the knowledge base KB 1 and the literal quantity L 2 in the knowledge base KB 2 , date, numeral, full name are represented as unified format;

(1-3)去除知识库KB1中的字面量L1和知识库KB2中的字面量L2中停用词字符、符号字符、语言标签字符,得到处理后知识库KB′1和KB′2(1-3) Remove the stop word characters, symbol characters, and language label characters in the literal quantity L 1 in the knowledge base KB 1 and the literal quantity L 2 in the knowledge base KB 2 , and obtain the processed knowledge base KB′ 1 and KB '2;

(1-4)统计获取与知识库KB′1相对的关系集R1、与知识库KB′2相对应的关系集R2(1-4) statistically obtain the relation set R 1 corresponding to the knowledge base KB′ 1 and the relation set R 2 corresponding to the knowledge base KB′ 2 ;

(1-5)比较知识库KB′1与知识库KB′2中的所有实体,获取初始匹配实体对集 (1-5) Compare all entities in knowledge base KB′ 1 and knowledge base KB′ 2 , and obtain the initial matching entity pair set

知识库定义为六元组(E,L,R,P,FR,FP),其中,E,L,R,P分别表示实体、字面量、关系以及属性的集合;代表实体-关系-实体的三元组集合,表示宾语为实体的关系事实;代表实体-属性-字面量的三元组集合,表示宾语为字面量的属性事实;FR和FP中都存在无意义的信息,例如:某些知识库中包含用于抽取三元组的原文本语料,这些信息会影响算法的效率。另外,某些包含“sameAs”关系的三元组也应被去除。The knowledge base is defined as a six-tuple (E, L, R, P, F R , F P ), where E, L, R, and P respectively represent a collection of entities, literals, relationships, and attributes; Represents the triplet set of entity-relation-entity, and represents the relationship fact that the object is an entity; Represents the set of triples of entity-attribute-literal, and represents the fact that the object is a literal attribute; there are meaningless information in both FR and FP , for example: some knowledge bases contain triples for extracting The original text corpus, this information will affect the efficiency of the algorithm. In addition, some triples containing the "sameAs" relationship should also be removed.

所述步骤(1-4)的具体过程为:The concrete process of described step (1-4) is:

对于知识库KB′1,遍历属于该知识库的三元组集合FR1中的所有三元组(实体-关系-实体),统计得到关系集R1;对于知识库KB′2,遍历属于该知识库的三元组集合FR2中的所有三元组(实体-关系-实体),统计得到关系集R2,关系集R1和关系集R2用于后续的知识库分区操作。For the knowledge base KB′ 1 , traverse all the triples (entity-relationship-entity) in the triple set F R1 belonging to the knowledge base, and obtain the relation set R 1 through statistics; for the knowledge base KB′ 2 , traverse all the triples belonging to the All the triples (entity-relationship-entity) in the triplet set F R2 of the knowledge base are counted to obtain the relational set R 2 , the relational set R 1 and the relational set R 2 for subsequent knowledge base partitioning operations.

步骤(1-5)中,所述的初始匹配实体对集的获取过程为:In step (1-5), the initial matching entity pair set The acquisition process is:

首先,提取知识库KB′1中的所有实体组成实体集E1,提取知识库KB′2中的所有实体组成实体集E2;并以实体集E1中的任一实体与实体集E2中的任一实体的笛卡尔积作为实体对,组成实体对集;First, extract all the entities in the knowledge base KB′ 1 to form the entity set E 1 , extract all the entities in the knowledge base KB’ 2 to form the entity set E 2 ; and use any entity in the entity set E 1 to form the entity set E 2 The Cartesian product of any entity in is used as an entity pair to form an entity pair set;

然后,筛选获取实体对集中两实体姓名属性的字符串表示完全相同的实体对,得到预初始匹配实体对集;Then, filter and obtain the entity pairs whose strings of the two entity name attributes in the entity pair set represent identical entity pairs, and obtain the pre-initial matching entity pair set;

最后,筛选预初始匹配实体对集中具有一对一匹配关系的实体对,作为初始匹配实体对集 Finally, filter the entity pairs with one-to-one matching relationship in the pre-initial matching entity pair set as the initial matching entity pair set

所述的知识库对齐阶段的具体步骤为:The specific steps of the knowledge base alignment phase are as follows:

(2-1)输入知识库KB′1、知识库KB′2、关系集R1、关系集R2、初始匹配实体对集设置区块相似度阈值δb、实体相似度阈值δe、区块内实体数量阈值δ1以及区块内已匹配实体比率阈值δ2,匹配实体对集Me初始化为初始匹配实体对集 (2-1) Input knowledge base KB′ 1 , knowledge base KB′ 2 , relation set R 1 , relation set R 2 , initial matching entity pair set Set the block similarity threshold δ b , the entity similarity threshold δ e , the threshold of the number of entities in the block δ 1 , and the threshold of the ratio of matched entities in the block δ 2 , the matching entity pair set M e is initialized as the initial matching entity pair set

(2-2)随机选取关系集R1或关系集R2中的任一关系,利用该关系将知识库KB′1和知识库KB′2中的实体分成若干个区块,得到与知识库KB′1相对应的区块集B1、与知识库KB′2相对应的区块集B2(2-2) Randomly select any relation in relation set R 1 or relation set R 2 , use this relation to divide entities in knowledge base KB′ 1 and knowledge base KB′ 2 into several blocks, and obtain Block set B 1 corresponding to KB′ 1 , block set B 2 corresponding to knowledge base KB′ 2 ;

(2-3)去除区块集B1和区块集B2中易产生高计算量或难以生成匹配实体对的区块,得到精简区块集B′1和精简区块集B′2(2-3) Remove block sets B 1 and block sets B 2 that are prone to generate high computational load or blocks that are difficult to generate matching entity pairs, and obtain a simplified block set B' 1 and a simplified block set B'2;

(2-4)利用匹配实体对集Me中的所有匹配实体对度量精简区块集B′1中任一区块与精简区块集B′2中任一区块之间的相似度,选择相似度值大于区块相似度阈值δb的两个区块进行匹配,得到匹配区块对集;( 2-4 ) Use all matching entity pairs in the matching entity pair set Me to measure the similarity between any block in the reduced block set B'1 and any block in the reduced block set B'2 , Select two blocks whose similarity value is greater than the block similarity threshold δ b to match, and obtain a matching block pair set;

(2-5)对属于匹配区块对集中的任一匹配区块对,以该匹配区块对的一个区块中的任一未匹配实体与该匹配区块对的另一个区块中的任一未匹配实体的笛卡尔积作为候选实体对,组成候选实体对集;(2-5) For any matching block pair belonging to the matching block pair set, any unmatched entity in one block of the matching block pair and any unmatched entity in the other block of the matching block pair The Cartesian product of any unmatched entity is used as a candidate entity pair to form a candidate entity pair set;

(2-6)判断是否未发现新候选实体对,若否,跳转执行步骤(2-7),若是,结束迭代,输出匹配实体对集Me(2-6) Judging whether no new candidate entity pair has been found, if not, jump to step (2-7), if so, end the iteration, and output the matching entity pair set M e ;

(2-7)计算候选实体对集中每个候选实体对中两实体之间的相似度,将相似度值大于实体相似度阈值δe对应的候选实体对添加至匹配实体对集Me中,剩下的候选实体对舍弃;(2-7) Calculate the similarity between two entities in each candidate entity pair in the candidate entity pair set, and add the candidate entity pair whose similarity value is greater than the entity similarity threshold δ e to the matching entity pair set M e , The remaining candidate entity pairs are discarded;

(2-8)判断迭代次数是否小于迭代阈值,都否,跳转执行步骤(2-2);若是,结束迭代,输出匹配实体对集Me(2-8) Determine whether the number of iterations is less than the iteration threshold, if not, skip to step (2-2); if yes, end the iteration, and output the matching entity pair set M e .

步骤(2-2)中,所述的利用关系将知识库中的实体分成若干个区块的具体过程为;In step (2-2), the specific process of dividing the entity in the knowledge base into several blocks by using the relationship is;

首先,对于知识库KB′1中的三元组集合FR1,统计得到三元组集合FR1中n种宾语实体;First, for the triplet set F R1 in the knowledge base KB′ 1 , get n types of object entities in the triplet set F R1 through statistics;

然后,对于每种宾语实体,将三元组集合FR1中与其相对应的所有主语实体放在一起,得到1个区块,n种宾语实体得到n个区块,组成区块集B1Then, for each object entity, put together all the subject entities corresponding to it in the triple set FR1 to obtain 1 block, and n types of object entities to obtain n blocks to form a block set B 1 ;

利用同样的方法得到区块集B2Use the same method to get the block set B 2 .

步骤(2-3)中,所述的易产生高计算量或难以生成匹配实体对的区块包括:实体数量超过阈值δ1的区块、已匹配实体比率小于阈值δ2的区块以及实体都已经匹配的区块。In step (2-3), the blocks that are prone to high computational load or difficult to generate matching entity pairs include: blocks with the number of entities exceeding the threshold δ1, blocks with the ratio of matched entities less than the threshold δ2 , and entities blocks that are already matched.

步骤(2-4)中,区块间的相似度的获取方法为:In step (2-4), the method for obtaining the similarity between blocks is:

将每个区块看是实体的集合,已经匹配的实体对看作是两集合间的相同元素,利用集合相似度来度量区块间的相似度,相似度simblock(bk,bl)的计算公式为:Each block is regarded as a collection of entities, and the matched entity pairs are regarded as the same elements between the two collections, and the similarity between blocks is measured by using the similarity of the collection, similarity sim block (b k , b l ) The calculation formula is:

其中,bk和bl表示两区块,|bk∩bl|表示两个区块内匹配实体对数量,|bk∪bl|表示两个区块内总实体数量。Among them, b k and b l represent two blocks, |b k ∩ b l | represents the number of matching entity pairs in the two blocks, and |b k ∪ b l | represents the total number of entities in the two blocks.

步骤(2-7)中,实体之间的相似度的获取公式为:In step (2-7), the formula for obtaining the similarity between entities is:

sim(ei,ej)=αsimstring(ei,ej)+(1-α)simblock(bk,bl)sim(e i ,e j )=αsim string (e i ,e j )+(1-α)sim block (b k ,b l )

s.t.ei∈bk,ej∈bl ste i ∈ b k , e j ∈ b l

其中,bk和bl分别表示实体ei和ej所在的区块,simstring(ei,ej)和simblock(bk,bl)分别表示实体间的字符串相似度和区块相似度,α是字符串相似度的权重,取值范围为[0,1]。Among them, b k and b l represent the blocks where entities e i and e j are located respectively, and sim string (e i , e j ) and sim block (b k , b l ) represent the string similarity and block Block similarity, α is the weight of string similarity, and the value range is [0,1].

作为优选,采用基于Levenshtein距离、基于Jaro-Winker距离、基于q-gram及基于I-SUB的相似度函数,并通过线性加权的方式组合这些相似度度量函数计算获得字符串相似度。Preferably, similarity functions based on Levenshtein distance, Jaro-Winker distance, q-gram and I-SUB are used, and these similarity measurement functions are combined in a linear weighted manner to obtain string similarity.

本发明结合迭代匹配思想进行异构知识库对齐,使用迭代框架遍历关系对知识库进行分区,扩大了候选实体对的搜索空间;同时,采用使用分治思想挑选和确认候选实体对,使得每个实体仅需和几个候选实体进行全面比较,提高了方法的效率。与现有的方法相比,其优点在于:The invention combines the idea of iterative matching to align heterogeneous knowledge bases, uses the iterative framework to traverse the relationship to partition the knowledge base, and expands the search space for candidate entity pairs; at the same time, adopts the idea of divide and conquer to select and confirm candidate entity pairs, so that each Entities only need to be comprehensively compared with a few candidate entities, which improves the efficiency of the method. Compared with existing methods, its advantages are:

(1)将知识库对齐看成一个迭代过程。在不同迭代中,遍历各关系对知识库进行分区,并利用匹配的区块对挑选候选实体对,使得对齐方法不依赖于知识库间可对齐的关系和属性。(1) View knowledge base alignment as an iterative process. In different iterations, each relation is traversed to partition the knowledge base, and the matching block pairs are used to select candidate entity pairs, so that the alignment method does not depend on the alignable relations and attributes between the knowledge bases.

(2)在每轮迭代中,仅发现少量匹配实体对,并将这些匹配实体对用于候选实体对的挑选,由于挑选候选实体对的过程使用了更多匹配实体对的信息,因此提高了候选实体对的质量。(2) In each round of iteration, only a small number of matching entity pairs are found, and these matching entity pairs are used for the selection of candidate entity pairs. Since the process of selecting candidate entity pairs uses more information about matching entity pairs, the improvement of The quality of the candidate entity pairs.

附图说明Description of drawings

图1是本发明基于迭代匹配的大型异构知识库对齐方法的流程框图;Fig. 1 is the flow diagram of the method for aligning large-scale heterogeneous knowledge bases based on iterative matching in the present invention;

图2是本发明基于迭代匹配的大型异构知识库对齐方法中数据预处理阶段的流程图;Fig. 2 is a flow chart of the data preprocessing stage in the large-scale heterogeneous knowledge base alignment method based on iterative matching in the present invention;

图3是本发明基于迭代匹配的大型异构知识库对齐方法中知识库对齐阶段的流程图。Fig. 3 is a flowchart of the knowledge base alignment stage in the iterative matching-based large-scale heterogeneous knowledge base alignment method of the present invention.

具体实施方式Detailed ways

为了更为具体地描述本发明,下面结合附图及具体实施方式对本发明的技术方案进行详细说明。In order to describe the present invention more specifically, the technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示,本发明基于迭代匹配的大型异构知识库对齐方法分为数据预处理和知识库对齐两个部分。数据预处理部分:对原知识库KB中的数据进行筛选、统一数据格式,并获取知识库中的关系和初始匹配实体对;知识库对齐部分:首先利用知识库中的关系对预处理后的知识库分区,并精简区块,然后利用已匹配实体对匹配区块,得到匹配区块对,接着在匹配区块对中挑选候选实体对,并结合相似度度量方法和阈值确认候选实体对,最后重复上述步骤,直到不能发现新的候选实体对,即可得到所有匹配实体对。As shown in FIG. 1 , the method for aligning large-scale heterogeneous knowledge bases based on iterative matching in the present invention is divided into two parts: data preprocessing and knowledge base alignment. Data preprocessing part: filter the data in the original knowledge base KB, unify the data format, and obtain the relationship in the knowledge base and the initial matching entity pair; knowledge base alignment part: first use the relationship in the knowledge base to pair the preprocessed The knowledge base is partitioned, and the blocks are simplified, and then the matched entity pairs are used to match the blocks to obtain the matched block pairs, and then the candidate entity pairs are selected from the matched block pairs, and the candidate entity pairs are confirmed by combining the similarity measurement method and the threshold value, Finally, the above steps are repeated until no new candidate entity pairs can be found, and all matching entity pairs can be obtained.

图2所示的是数据预处理阶段的流程图;根据图2,该阶段分为以下步骤:Figure 2 shows the flow chart of the data preprocessing stage; according to Figure 2, this stage is divided into the following steps:

S1-1,输入任意两个原知识库KB1、KB2,并去除知识库KB1、KB2中与对齐任务无关的信息。S1-1, input any two original knowledge bases KB 1 and KB 2 , and remove information irrelevant to the alignment task in the knowledge bases KB 1 and KB 2 .

知识库定义为六元组(E,L,R,P,FR,FP),其中,E,L,R,P分别表示实体、字面量、关系以及属性的集合;代表实体-关系-实体的三元组集合,表示宾语为实体的关系事实;代表实体-属性-字面量的三元组集合,表示宾语为字面量的属性事实;FR和FP中都存在无意义的信息,例如:某些知识库中包含用于抽取三元组的原文本语料,这些信息会影响算法的效率。另外,某些包含“same As”关系的三元组也应被去除。The knowledge base is defined as a six-tuple (E, L, R, P, F R , F P ), where E, L, R, and P respectively represent a collection of entities, literals, relationships, and attributes; Represents the triplet set of entity-relation-entity, and represents the relationship fact that the object is an entity; Represents the set of triples of entity-attribute-literal, and represents the fact that the object is a literal attribute; there are meaningless information in both FR and FP , for example: some knowledge bases contain triples for extracting The original text corpus, this information will affect the efficiency of the algorithm. In addition, some triples containing "same As" relations should also be removed.

S1-2,对知识库KB1中的字面量L1和知识库KB2中的字面量L2统一数据格式,将日期、数字、姓名表示为统一格式。 S1-2 , unify the data format for the literal quantity L1 in the knowledge base KB1 and the literal quantity L2 in the knowledge base KB2 , and express the date, number, and name in a unified format.

不同知识库中的姓名、日期、数字等字面量的表达方式可能不同,例如:“2016-01-01”和“01.01.2016”。将这些信息统一,利于后续比较,另外,方法将字面量统一成小写。Literal values such as names, dates, and numbers may be expressed in different ways in different knowledge bases, for example: "2016-01-01" and "01.01.2016". Unifying these information is beneficial for subsequent comparisons. In addition, the method unifies literals into lowercase.

S1-3,去除知识库KB1中的字面量L1和知识库KB2中的字面量L2中停用词字符、符号字符、语言标签等无意义字符,得到处理后知识库KB′1和KB′2S1-3, remove meaningless characters such as stop word characters, symbol characters, and language tags in the literal amount L 1 in the knowledge base KB 1 and the literal amount L 2 in the knowledge base KB 2 , and obtain the processed knowledge base KB′ 1 and KB′ 2 .

知识库中对于实体的属性描述中可能会存在一些无意义字符,例如:“the”、“a”和“an”等停用词,“#”、“!”和“*”等符号以及“@en”等语言标签。这些字符影响实体对的相似度度量,因此去除这些字符。There may be some meaningless characters in the attribute description of the entity in the knowledge base, such as stop words such as "the", "a" and "an", symbols such as "#", "!" and "*", and " @en" and other language tags. These characters affect the similarity measure of entity pairs, so these characters are removed.

S1-4,统计获取与知识库KB′1相对的关系集R1、与知识库KB′2相对应的关系集R2S1-4. Obtain statistically the relation set R 1 corresponding to the knowledge base KB′ 1 and the relation set R 2 corresponding to the knowledge base KB′ 2 .

此步骤中,对于知识库KB′1,遍历属于该知识库的三元组集合FR1中的所有三元组(实体-关系-实体),统计得到关系集R1;对于知识库KB′2,遍历属于该知识库的三元组集合FR2中的所有三元组(实体-关系-实体),统计得到关系集R2,关系集R1和关系集R2用于后续的知识库分区操作。In this step, for the knowledge base KB′ 1 , traverse all triples (entity-relationship-entity) in the triple set F R1 belonging to the knowledge base, and obtain the relation set R 1 through statistics; for the knowledge base KB′ 2 , traverse all triples (entity-relationship-entity) in the triplet set F R2 belonging to the knowledge base, and obtain the relation set R 2 , relation set R 1 and relation set R 2 for subsequent knowledge base partitioning operate.

S1-5,比较知识库KB′1与知识库KB′2中的所有实体,获取初始匹配实体对集 S1-5, compare all the entities in the knowledge base KB′ 1 and the knowledge base KB′ 2 , and obtain the initial matching entity pair set

此步骤中,初始匹配实体对集的获取过程为:In this step, the initial matching entity pair set The acquisition process is:

首先,提取知识库KB′1中的所有实体组成实体集E1,提取知识库KB′2中的所有实体组成实体集E2;并以实体集E1中的任一实体与实体集E2中的任一实体的笛卡尔积作为实体对,组成实体对集;First, extract all the entities in the knowledge base KB′ 1 to form the entity set E 1 , extract all the entities in the knowledge base KB’ 2 to form the entity set E 2 ; and use any entity in the entity set E 1 to form the entity set E 2 The Cartesian product of any entity in is used as an entity pair to form an entity pair set;

然后,筛选获取实体对集中两实体姓名属性的字符串表示完全相同的实体对,得到预初始匹配实体对集;Then, filter and obtain the entity pairs whose strings of the two entity name attributes in the entity pair set represent identical entity pairs, and obtain the pre-initial matching entity pair set;

最后,筛选预初始匹配实体对集中具有一对一匹配关系的实体对,作为初始匹配实体对集 Finally, filter the entity pairs with one-to-one matching relationship in the pre-initial matching entity pair set as the initial matching entity pair set

图3所示的是知识库对齐阶段的流程图;根据图3,该阶段分为以下步骤:Figure 3 shows the flowchart of the knowledge base alignment phase; according to Figure 3, this phase is divided into the following steps:

S2-1,输入知识库KB′1、知识库KB′2、关系集R1、关系集R2、初始匹配实体对集设置区块相似度阈值δb为0.2、实体相似度阈值δe为0.65、区块内实体数量阈值δ1为50以及区块内已匹配实体比率阈值δ2为0.3,匹配实体对集Me初始化为初始匹配实体对集 S2-1, input knowledge base KB′ 1 , knowledge base KB′ 2 , relation set R 1 , relation set R 2 , initial matching entity pair set Set the block similarity threshold δ b to 0.2, the entity similarity threshold δ e to 0.65, the threshold of the number of entities in the block δ 1 to 50, and the threshold of the ratio of matched entities in the block δ 2 to 0.3, the matching entity pair set M e initialized to the initial set of matching entity pairs

S2-2,随机选取关系集R1或关系集R2中的任一关系,利用该关系将知识库KB′1和知识库KB′2中的实体分成若干个区块,得到与知识库KB′1相对应的区块集B1、与知识库KB′2相对应的区块集B2S2-2. Randomly select any relation in relation set R 1 or relation set R 2 , use this relation to divide entities in knowledge base KB′ 1 and knowledge base KB′ 2 into several blocks, and obtain The block set B 1 corresponding to ′ 1 and the block set B 2 corresponding to the knowledge base KB′ 2 .

此步骤中,利用关系将知识库中的实体分成若干个区块的具体过程为;In this step, the specific process of dividing the entities in the knowledge base into several blocks by using the relationship is as follows;

首先,对于知识库KR′1中的三元组集合FR1,统计得到三元组集合FR1中n种宾语实体;First, for the triplet set F R1 in the knowledge base KR′ 1 , get n types of object entities in the triplet set F R1 through statistics;

然后,对于每种宾语实体,将三元组集合FR1中与其相对应的所有主语实体放在一起,得到1个区块,n种宾语实体得到n个区块,组成区块集B1Then, for each object entity, put together all the subject entities corresponding to it in the triple set FR1 to obtain 1 block, and n types of object entities to obtain n blocks to form block set B 1 .

利用同样的方法得到区块集B2,即:Use the same method to get the block set B 2 , namely:

首先,对于知识库KB′2中的三元组集合FR2,统计得到三元组集合FR2中n种宾语实体;First, for the triplet set F R2 in the knowledge base KB′ 2 , get n types of object entities in the triplet set F R2 through statistics;

然后,对于每种宾语实体,将三元组集合FR2中与其相对应的所有主语实体放在一起,得到1个区块,n种宾语实体得到n个区块,组成区块集B2Then, for each object entity, put together all the subject entities corresponding to it in the triple set FR2 to obtain 1 block, and n types of object entities to obtain n blocks to form block set B 2 .

S2-3,去除区块集B1和区块集B2中易产生高计算量或难以生成匹配实体对的区块,得到精简区块集B′1和精简区块集B′2S2-3, remove the blocks in the block set B 1 and the block set B 2 that are prone to high computational load or difficult to generate matching entity pairs, and obtain the simplified block set B' 1 and the simplified block set B' 2 .

此步骤中,易产生高计算量或难以生成匹配实体对的区块包括:实体数量超过阈值δ1的区块、已匹配实体比率小于阈值δ2的区块以及实体都已经匹配的区块。In this step, the blocks that tend to generate high computational load or are difficult to generate matching entity pairs include: blocks with the number of entities exceeding the threshold δ1, blocks with the ratio of matched entities less than the threshold δ2 , and blocks with entities that have already been matched.

S2-4,利用匹配实体对集Me中的所有匹配实体对度量精简区块集B′1中任一区块与精简区块集B′2中任一区块之间的相似度,选择相似度值大于区块相似度阈值δb的两个区块进行匹配,得到匹配区块对集。S2-4, using all matching entity pairs in the matching entity pair set M e to measure the similarity between any block in the reduced block set B′ 1 and any block in the reduced block set B′ 2 , select Two blocks whose similarity value is greater than the block similarity threshold δ b are matched to obtain a matching block pair set.

将每个区块看是实体的集合,已经匹配的实体对看作是两集合间的相同元素,利用集合相似度来度量区块间的相似度,相似度simblock(bk,bl)的计算公式为:Each block is regarded as a collection of entities, and the matched entity pairs are regarded as the same elements between the two collections, and the similarity between blocks is measured by using the similarity of the collection, similarity sim block (b k , b l ) The calculation formula is:

其中,bk和bl表示两区块,|bk∩bl|表示两个区块内匹配实体对数量,|bk∪bl|表示两个区块内总实体数量。Among them, b k and b l represent two blocks, |b k ∩ b l | represents the number of matching entity pairs in the two blocks, and |b k ∪ b l | represents the total number of entities in the two blocks.

S2-5,对属于匹配区块对集中的任一匹配区块对,以该匹配区块对的一个区块中的任一未匹配实体与该匹配区块对的另一个区块中的任一未匹配实体的笛卡尔积作为候选实体对,组成候选实体对集。S2-5, for any matching block pair belonging to the matching block pair set, any unmatched entity in one block of the matching block pair and any unmatched entity in the other block of the matching block pair The Cartesian product of an unmatched entity is used as a candidate entity pair to form a candidate entity pair set.

S2-6,判断是否未发现新候选实体对,若否,跳转执行S2-7,若是,结束迭代,输出匹配实体对集MeS2-6, judging whether no new candidate entity pair is found, if not, skip to S2-7, if so, end the iteration, and output the matching entity pair set M e .

S2-7,计算候选实体对集中每个候选实体对中两实体之间的相似度,将相似度值大于实体相似度阈值δe对应的候选实体对添加至匹配实体对集Me中,剩下的候选实体对舍弃。S2-7. Calculate the similarity between two entities in each candidate entity pair in the candidate entity pair set, and add the candidate entity pair whose similarity value is greater than the entity similarity threshold δ e to the matching entity pair set M e , and the remaining The following candidate entity pairs are discarded.

此步骤中,实体之间的相似度通过2种方式进行度量:字符串相似度和区块相似度,并以一定的权重组合这两种相似度,其公式如下:In this step, the similarity between entities is measured in two ways: string similarity and block similarity, and these two similarities are combined with a certain weight. The formula is as follows:

sim(ei,ej)=αsimstring(ei,ej)+(1-α)simblock(bk,bl)sim(e i ,e j )=αsim string (e i ,e j )+(1-α)sim block (b k ,b l )

s.t.ei∈bk,ej∈bl ste i ∈ b k , e j ∈ b l

其中,simstring(ei,ej)和simblock(bk,bl)分别表示实体间的字符串相似度和区块相似度,bk和bl分别表示实体ei和ej所在的区块,α是字符串相似度的权重,取值为0.6。针对实体ei和ej共有的属性对(例如:姓名),字符串相似度度量这些属性值的相似度。方法使用了多种相似度度量函数,如:基于Levenshtein距离、基于Jaro-Winker距离、基于q-gram及基于I-SUB的相似度函数,并通过线性加权的方式组合这些相似度度量函数。区块相似度通过实体所在区块间的相似度来表示实体的相似度。得到实体间的相似度之后,结合阈值δe判断该对实体是否匹配,并将新发现的匹配实体对加入所有匹配实体对。Among them, sim string (e i , e j ) and sim block (b k , b l ) represent the string similarity and block similarity between entities respectively, and b k and b l represent the locations of entities e i and e j respectively block, α is the weight of string similarity, with a value of 0.6. For the attribute pairs shared by entities e i and e j (for example: name), the string similarity measures the similarity of these attribute values. The method uses a variety of similarity measurement functions, such as: based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and based on I-SUB, and combined these similarity measurement functions by linear weighting. Block similarity represents the similarity of entities through the similarity between the blocks where entities are located. After obtaining the similarity between entities, combine the threshold δ e to judge whether the pair of entities match, and add the newly discovered matching entity pair to all matching entity pairs.

S2-8,判断迭代次数是否小于迭代阈值,都否,跳转执行S2-2;若是,结束迭代,输出匹配实体对集MeS2-8, judging whether the number of iterations is less than the iteration threshold, if not, skip to S2-2; if so, end the iteration, and output the matching entity pair set M e .

以上所述的具体实施方式对本发明的技术方案和有益效果进行了详细说明,应理解的是以上所述仅为本发明的最优选实施例,并不用于限制本发明,凡在本发明的原则范围内所做的任何修改、补充和等同替换等,均应包含在本发明的保护范围之内。The above-mentioned specific embodiments have described the technical solutions and beneficial effects of the present invention in detail. It should be understood that the above-mentioned are only the most preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, supplements and equivalent replacements made within the scope shall be included in the protection scope of the present invention.

Claims (10)

1.一种基于迭代匹配的大型异构知识库对齐方法,具体包括:1. A large-scale heterogeneous knowledge base alignment method based on iterative matching, including: 数据预处理阶段:对任意两个原知识库KB1、KB2中的数据进行筛选、统一数据格式以及剔除无意义字符处理,并统计获取与处理后知识库KB′1相对应的关系集R1、与处理后知识库KB′2相对应的关系集R2,比较获取初始匹配实体对集 Data preprocessing stage: filter the data in any two original knowledge bases KB 1 and KB 2 , unify the data format and eliminate meaningless characters, and obtain the relation set R corresponding to the processed knowledge base KB′ 1 1. The relationship set R 2 corresponding to the processed knowledge base KB′ 2 is compared to obtain the initial matching entity pair set 知识库对齐阶段:利用关系集R1与关系集R2中的关系对知识库KB′1和知识库KB′2进行分区,并对每个区块进行精简,得到精简区块集B′1和B′2;然后,利用初始匹配实体对集匹配精简区块集B′1和B′2中的区块,得到匹配区块对,最后,在匹配区块对中挑选候选实体对,并结合相似度度量方法和阈值确认候选实体对。Knowledge base alignment stage: use the relationship in relation set R 1 and relation set R 2 to partition knowledge base KB′ 1 and knowledge base KB′ 2 , and simplify each block to obtain a simplified block set B′ 1 and B′ 2 ; then, use the initial matching entity pair set Match the blocks in the reduced block sets B' 1 and B' 2 to obtain matching block pairs. Finally, select candidate entity pairs from the matching block pairs, and combine similarity measurement methods and thresholds to confirm candidate entity pairs. 2.如权利要求1所述的基于迭代匹配的大型异构知识库对齐方法,其特征在于,所述的数据预处理阶段的具体步骤为:2. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 1, is characterized in that, the specific steps of described data preprocessing stage are: (1-1)输入任意两个原知识库KB1、KB2,并去除知识库KB1、KB2中与对齐任务无关的信息;(1-1) Input any two original knowledge bases KB 1 and KB 2 , and remove the information irrelevant to the alignment task in the knowledge bases KB 1 and KB 2 ; (1-2)对知识库KB1中的字面量L1和知识库KB2中的字面量L2统一数据格式,将日期、数字、姓名表示为统一格式;(1-2) unify the data format to the literal quantity L 1 in the knowledge base KB 1 and the literal quantity L 2 in the knowledge base KB 2 , date, numeral, full name are represented as unified format; (1-3)去除知识库KB1中的字面量L1和知识库KB2中的字面量L2中停用词字符、符号字符、语言标签字符,得到处理后知识库KB′1和KB′2(1-3) Remove the stop word characters, symbol characters, and language label characters in the literal quantity L 1 in the knowledge base KB 1 and the literal quantity L 2 in the knowledge base KB 2 , and obtain the processed knowledge base KB′ 1 and KB '2; (1-4)统计获取与知识库KB′1相对应的关系集R1、与知识库KB′2相对应的关系集R2(1-4) statistically obtain the relation set R 1 corresponding to the knowledge base KB′ 1 and the relation set R 2 corresponding to the knowledge base KB′ 2 ; (1-5)比较知识库KB′1与知识库KB′2中的所有实体,获取初始匹配实体对集 (1-5) Compare all entities in knowledge base KB′ 1 and knowledge base KB′ 2 , and obtain the initial matching entity pair set 3.如权利要求2所述的基于迭代匹配的大型异构知识库对齐方法,其特征在于,所述步骤(1-4)的具体过程为:3. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 2, is characterized in that, the specific process of described step (1-4) is: 对于知识库KB′1,遍历属于该知识库的三元组集合FR1中的所有的实体-关系-实体三元组,统计得到关系集R1;对于知识库KB′2,遍历属于该知识库的三元组集合FR2中的所有的实体-关系-实体三元组,统计得到关系集R2For the knowledge base KB′ 1 , traverse all the entity-relationship-entity triples in the triple set F R1 belonging to the knowledge base, and obtain the relation set R 1 through statistics; for the knowledge base KB′ 2 , traverse all the triples belonging to the knowledge All the entity-relationship-entity triplets in the triplet set F R2 of the library are counted to obtain the relational set R 2 . 4.如权利要求2所述的基于迭代匹配的大型异构知识库对齐方法,其特征在于,步骤(1-5)中,所述的初始匹配实体对集的获取过程为:4. The large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 2, characterized in that, in step (1-5), the initial matching entity pair set The acquisition process is: 首先,提取知识库KB′1中的所有实体组成实体集E1,提取知识库KB′2中的所有实体组成实体集E2;并以实体集E1中的任一实体与实体集E2中的任一实体的笛卡尔积作为实体对,组成实体对集;First, extract all the entities in the knowledge base KB′ 1 to form the entity set E 1 , extract all the entities in the knowledge base KB’ 2 to form the entity set E 2 ; and use any entity in the entity set E 1 to form the entity set E 2 The Cartesian product of any entity in is used as an entity pair to form an entity pair set; 然后,筛选获取实体对集中两实体姓名属性的字符串表示完全相同的实体对,得到预初始匹配实体对集;Then, filter and obtain the entity pairs whose strings of the two entity name attributes in the entity pair set represent identical entity pairs, and obtain the pre-initial matching entity pair set; 最后,筛选预初始匹配实体对集中具有一对一匹配关系的实体对,作为初始匹配实体对集 Finally, filter the entity pairs with one-to-one matching relationship in the pre-initial matching entity pair set as the initial matching entity pair set 5.如权利要求1所述的基于迭代匹配的大型异构知识库对齐方法,其特征在于,所述的知识库对齐阶段的具体步骤为:5. The large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 1, wherein the specific steps of the knowledge base alignment stage are: (2-1)输入知识库KB′1、知识库KB′2、关系集R1、关系集R2、初始匹配实体对集设置区块相似度阈值δb、实体相似度阈值δe、区块内实体数量阈值δ1以及区块内已匹配实体比率阈值δ2,匹配实体对集Me初始化为初始匹配实体对集 (2-1) Input knowledge base KB′ 1 , knowledge base KB′ 2 , relation set R 1 , relation set R 2 , initial matching entity pair set Set the block similarity threshold δ b , the entity similarity threshold δ e , the threshold of the number of entities in the block δ 1 , and the threshold of the ratio of matched entities in the block δ 2 , the matching entity pair set M e is initialized as the initial matching entity pair set (2-2)随机选取关系集R1或关系集R2中的任一关系,利用该关系将知识库KB′1和知识库KB′2中的实体分成若干个区块,得到与知识库KB′1相对应的区块集B1、与知识库KB′2相对应的区块集B2(2-2) Randomly select any relation in relation set R 1 or relation set R 2 , use this relation to divide entities in knowledge base KB′ 1 and knowledge base KB′ 2 into several blocks, and obtain Block set B 1 corresponding to KB′ 1 , block set B 2 corresponding to knowledge base KB′ 2 ; (2-3)去除区块集B1和区块集B2中易产生高计算量或难以生成匹配实体对的区块,得到精简区块集B′1和精简区块集B′2(2-3) Remove block sets B 1 and block sets B 2 that are prone to generate high computational load or blocks that are difficult to generate matching entity pairs, and obtain a simplified block set B' 1 and a simplified block set B'2; (2-4)利用匹配实体对集Me中的所有匹配实体对度量精简区块集B′1中任一区块与精简区块集B′2中任一区块之间的相似度,选择相似度值大于区块相似度阈值δb的两个区块进行匹配,得到匹配区块对集;( 2-4 ) Use all matching entity pairs in the matching entity pair set Me to measure the similarity between any block in the reduced block set B'1 and any block in the reduced block set B'2 , Select two blocks whose similarity value is greater than the block similarity threshold δ b to match, and obtain a matching block pair set; (2-5)对属于匹配区块对集中的任一匹配区块对,以该匹配区块对的一个区块中的任一未匹配实体与该匹配区块对的另一个区块中的任一未匹配实体的笛卡尔积作为候选实体对,组成候选实体对集;(2-5) For any matching block pair belonging to the matching block pair set, any unmatched entity in one block of the matching block pair and any unmatched entity in the other block of the matching block pair The Cartesian product of any unmatched entity is used as a candidate entity pair to form a candidate entity pair set; (2-6)判断是否未发现新候选实体对,若否,跳转执行步骤(2-7),若是,结束迭代,输出匹配实体对集Me(2-6) Judging whether no new candidate entity pair has been found, if not, jump to step (2-7), if so, end the iteration, and output the matching entity pair set M e ; (2-7)计算候选实体对集中每个候选实体对中两实体之间的相似度,将相似度值大于实体相似度阈值δe对应的候选实体对添加至匹配实体对集Me中,剩下的候选实体对舍弃;(2-7) Calculate the similarity between two entities in each candidate entity pair in the candidate entity pair set, and add the candidate entity pair whose similarity value is greater than the entity similarity threshold δ e to the matching entity pair set M e , The remaining candidate entity pairs are discarded; (2-8)判断迭代次数是否小于迭代阈值,都否,跳转执行步骤(2-2);若是,结束迭代,输出匹配实体对集Me(2-8) Determine whether the number of iterations is less than the iteration threshold, if not, skip to step (2-2); if yes, end the iteration, and output the matching entity pair set M e . 6.如权利要求5所述的基于迭代匹配的大型异构知识库对齐方法,其特征在于,步骤(2-2)中,所述的利用关系将知识库中的实体分成若干个区块的具体过程为;6. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 5, is characterized in that, in step (2-2), described utilization relation divides the entity in the knowledge base into several blocks The specific process is; 首先,对于知识库KB′1中的三元组集合FR1,统计得到三元组集合FR1中n种宾语实体;First, for the triplet set F R1 in the knowledge base KB′ 1 , get n types of object entities in the triplet set F R1 through statistics; 然后,对于每种宾语实体,将三元组集合FR1中与其相对应的所有主语实体放在一起,得到1个区块,n种宾语实体得到n个区块,组成区块集B1Then, for each object entity, put together all the subject entities corresponding to it in the triple set FR1 to obtain 1 block, and n types of object entities to obtain n blocks to form a block set B 1 ; 利用同样的方法得到区块集B2Use the same method to get the block set B 2 . 7.如权利要求5所述的基于迭代匹配的大型异构知识库对齐方法,其特征在于,步骤(2-3)中,所述的易产生高计算量或难以生成匹配实体对的区块包括:实体数量超过阈值δ1的区块、已匹配实体比率小于阈值δ2的区块以及实体都已经匹配的区块。7. The large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 5, characterized in that, in step (2-3), the blocks that are prone to generate high computational load or are difficult to generate matching entity pairs Including: blocks with the number of entities exceeding the threshold δ 1 , blocks with the ratio of matched entities less than the threshold δ 2 , and blocks with entities that have already been matched. 8.如权利要求5所述的基于迭代匹配的大型异构知识库对齐方法,其特征在于,步骤(2-4)中,区块间的相似度的获取方法为:8. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 5, is characterized in that, in step (2-4), the acquisition method of the similarity between blocks is: 将每个区块看是实体的集合,已经匹配的实体对看作是两集合间的相同元素,利用集合相似度来度量区块间的相似度,相似度simblock(bk,bl)的计算公式为:Each block is regarded as a collection of entities, and the matched entity pairs are regarded as the same elements between the two collections, and the similarity between blocks is measured by using the similarity of the collection, similarity sim block (b k , b l ) The calculation formula is: 其中,bk和bl表示两区块,|bk∩bl|表示两个区块内匹配实体对数量,|bk∪bl|表示两个区块内总实体数量。Among them, b k and b l represent two blocks, |b k ∩ b l | represents the number of matching entity pairs in the two blocks, and |b k ∪ b l | represents the total number of entities in the two blocks. 9.如权利要求8所述的基于迭代匹配的大型异构知识库对齐方法,其特征在于,步骤(2-7)中,实体之间的相似度的获取公式为:9. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 8, is characterized in that, in step (2-7), the acquisition formula of the similarity between entities is: sim(ei,ej)=αsimstring(ei,ej)+(1-α)simblock(bk,bl)sim(e i ,e j )=αsim string (e i ,e j )+(1-α)sim block (b k ,b l ) s.t.ei∈bk,ej∈bl ste i ∈ b k , e j ∈ b l 其中,bk和bl分别表示实体ei和ej所在的区块,simstring(ei,ej)和simblock(bk,bl)分别表示实体间的字符串相似度和区块相似度,α是字符串相似度的权重,取值范围为[0,1]。Among them, b k and b l represent the blocks where entities e i and e j are located respectively, and sim string (e i , e j ) and sim block (b k , b l ) represent the string similarity and block Block similarity, α is the weight of string similarity, and the value range is [0,1]. 10.如权利要求9所述的基于迭代匹配的大型异构知识库对齐方法,其特征在于,采用基于Levenshtein距离、基于Jaro-Winker距离、基于q-gram及基于I-SUB的相似度函数,并通过线性加权的方式组合这些相似度度量函数计算获得字符串相似度。10. the large-scale heterogeneous knowledge base alignment method based on iterative matching as claimed in claim 9, is characterized in that, adopts the similarity function based on Levenshtein distance, based on Jaro-Winker distance, based on q-gram and based on I-SUB, And the string similarity is obtained by combining these similarity measurement functions in a linear weighted manner.
CN201710237034.6A 2017-04-12 2017-04-12 Alignment method for large heterogeneous knowledge bases based on iterative matching Expired - Fee Related CN107145523B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710237034.6A CN107145523B (en) 2017-04-12 2017-04-12 Alignment method for large heterogeneous knowledge bases based on iterative matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710237034.6A CN107145523B (en) 2017-04-12 2017-04-12 Alignment method for large heterogeneous knowledge bases based on iterative matching

Publications (2)

Publication Number Publication Date
CN107145523A CN107145523A (en) 2017-09-08
CN107145523B true CN107145523B (en) 2019-10-18

Family

ID=59774786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710237034.6A Expired - Fee Related CN107145523B (en) 2017-04-12 2017-04-12 Alignment method for large heterogeneous knowledge bases based on iterative matching

Country Status (1)

Country Link
CN (1) CN107145523B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460021B (en) * 2018-03-16 2021-10-12 安徽大学 Method for extracting problem method pairs in thesis title
CN109492114A (en) * 2018-11-16 2019-03-19 南京茂毓通软件科技有限公司 A kind of entity information recognition methods
CN109739939A (en) * 2018-12-29 2019-05-10 颖投信息科技(上海)有限公司 The data fusion method and device of knowledge mapping
CN110413704B (en) * 2019-06-27 2022-05-03 浙江大学 Entity Alignment Method Based on Weighted Neighborhood Information Coding
CN111191045B (en) * 2019-12-30 2023-06-16 创新奇智(上海)科技有限公司 Entity alignment method and system applied to knowledge graph
CN112699667B (en) * 2020-12-29 2024-05-21 京东科技控股股份有限公司 Entity similarity determination method, device, equipment and storage medium
CN112784609A (en) * 2021-03-16 2021-05-11 云知声智能科技股份有限公司 Method, apparatus, device and medium for determining whether medical record includes consultation opinions
CN113609304B (en) * 2021-07-20 2023-05-23 广州大学 Entity matching method and device
CN114417810B (en) * 2021-12-29 2024-07-09 东方财富信息股份有限公司 SimBlock algorithm for realizing high-quality text similarity calculation and realization method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679492A (en) * 2013-11-29 2015-06-03 国际商业机器公司 Computer-implemented technical support providing device and method
CN104899242A (en) * 2015-03-10 2015-09-09 四川大学 Mechanical product design two-dimensional knowledge pushing method based on design intent

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9411878B2 (en) * 2014-02-19 2016-08-09 International Business Machines Corporation NLP duration and duration range comparison methodology using similarity weighting

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679492A (en) * 2013-11-29 2015-06-03 国际商业机器公司 Computer-implemented technical support providing device and method
CN104899242A (en) * 2015-03-10 2015-09-09 四川大学 Mechanical product design two-dimensional knowledge pushing method based on design intent

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Interference Alignment with Cyclic Unidirectional;Mohammad Reza Nakhai et al;《2012 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC)》;20121129;第1-5页 *
中文异构百科知识库实体对齐;黄峻福等;《计算机应用》;20160710;第1881-1886,1898页 *

Also Published As

Publication number Publication date
CN107145523A (en) 2017-09-08

Similar Documents

Publication Publication Date Title
CN107145523B (en) Alignment method for large heterogeneous knowledge bases based on iterative matching
WO2021083239A1 (en) Graph data query method and apparatus, and device and storage medium
CN104035949B (en) Similarity data retrieval method based on locality sensitive hashing (LASH) improved algorithm
CN102768670B (en) Webpage clustering method based on node property label propagation
CN108763376B (en) A Knowledge Representation Learning Method Integrating Relation Path, Type, and Entity Description Information
CN110457486A (en) Method and device for human entity alignment based on knowledge graph
CN105893382A (en) Priori knowledge based microblog user group division method
CN113761221B (en) Knowledge graph entity alignment method based on graph neural network
CN104391881A (en) Word segmentation algorithm-based log parsing method and word segmentation algorithm-based log parsing system
CN105956158B (en) Method for Automatic Extraction of Network New Words Based on Massive Microblog Texts and User Information
CN104866471A (en) Instance matching method based on local sensitive Hash strategy
CN108520035A (en) Query Processing Method of SPARQL Basic Graph Pattern Based on Star Decomposition
Bouhamoum et al. Scaling up schema discovery for RDF datasets
CN112528067A (en) Graph database storage method, graph database reading method, graph database storage device, graph database reading device and graph database reading equipment
JP7615431B2 (en) METHOD, APPARATUS, ELECTRONIC DEVICE AND STORAGE MEDIUM FOR TRAINING LANGUAGE MODEL
CN105808262B (en) A kind of name matching process based on json formatted datas
CN104731887B (en) A kind of user method for measuring similarity in collaborative filtering
US20230056760A1 (en) Method and apparatus for processing graph data, device, storage medium, and program product
CN112905906B (en) A recommendation method and system integrating local collaboration and feature intersection
CN104778202B (en) The analysis method and system of event evolutionary process based on keyword
CN106383863A (en) Isomorphic sub-graph query optimization method
CN106682107B (en) Method and device for determining incidence relation of database table
CN104391952A (en) File system index establishing method and file system query implementing method
CN115129871A (en) Text category determination method and device, computer equipment and storage medium
CN107544962A (en) Social media text query extended method based on Similar Text feedback

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191018