CN113157946B

CN113157946B - Entity linking method, device, electronic equipment and storage medium

Info

Publication number: CN113157946B
Application number: CN202110529673.6A
Authority: CN
Inventors: 周效军; 李东晓
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2024-09-27
Anticipated expiration: 2041-05-14
Also published as: CN113157946A

Abstract

The present invention provides an entity linking method, device, electronic device and storage medium. The entity linking method includes: extracting entity mentions from a text to be analyzed; obtaining a candidate entity set of entity mentions from a knowledge graph, the candidate entity set includes at least one candidate entity, and the candidate entity includes description information; obtaining a first degree of association between the entity mention and the description information of each candidate entity, and obtaining a second degree of association between each candidate entity and the text to be analyzed; based on the first degree of association and the second degree of association, obtaining a target entity from the candidate entity set, and associating the entity mention with the target entity. The entity linking method of the present invention utilizes the co-occurrence relationship between words to accurately capture the semantic association between words, thereby ensuring that the entity mention can be accurately connected to the corresponding entity in the knowledge graph, improving the accuracy and reliability of entity linking, and effectively expanding the scale of the knowledge graph.

Description

Entity linking method, device, electronic device and storage medium

技术领域Technical Field

本发明涉及计算机技术领域，尤其涉及一种实体链接方法、装置、电子设备及存储介质。The present invention relates to the field of computer technology, and in particular to an entity linking method, device, electronic device and storage medium.

背景技术Background Art

在知识图谱构建过程中，需要进行实体链接，实体链接是将外部实体提及关联到已有知识图谱中对应的实体上。目前，实体链接技术包括根据实体提及的上下文语境与知识图谱中候选实体描述语境进行语义或者文本结构比较，计算相似度，进而判断是否链接。然而，实体提及上下文语境与知识图谱中候选实体描述语境在很多情况下不存在相似语义的词汇集，例如：“孙悟空大闹天宫”一文中“孙悟空”是实体提及，在知识图谱中存在一个候选实体“猴哥”，其描述为“猴哥西天取经”此时，基于语义计算和基于文本结构的词汇比较的实体链接技术均不能够将实体提及“孙悟空”与知识图谱中的候选实体“猴哥”链接。In the process of knowledge graph construction, entity linking is required. Entity linking is to associate external entity mentions with corresponding entities in the existing knowledge graph. At present, entity linking technology includes semantic or text structure comparison based on the context of entity mention and the description context of candidate entities in the knowledge graph, calculating similarity, and then judging whether to link. However, in many cases, there is no vocabulary set with similar semantics between the context of entity mention and the description context of candidate entities in the knowledge graph. For example, in the article "Sun Wukong Makes Trouble in Heaven", "Sun Wukong" is an entity mention, and there is a candidate entity "Monkey Brother" in the knowledge graph, which is described as "Monkey Brother Goes to the West to Journey to the West". At this time, entity linking technology based on semantic calculation and vocabulary comparison based on text structure cannot link the entity mention "Sun Wukong" with the candidate entity "Monkey Brother" in the knowledge graph.

发明内容Summary of the invention

本发明提供一种实体链接方法、装置、电子设备及存储介质，可以提升了实体链接的准确性和可靠性，有效地扩充了知识图谱的规模。The present invention provides an entity linking method, device, electronic device and storage medium, which can improve the accuracy and reliability of entity linking and effectively expand the scale of knowledge graphs.

本发明提供一种实体链接方法，包括：The present invention provides an entity linking method, comprising:

提取待分析文本的实体提及；Extract entity mentions from the text to be analyzed;

从知识图谱中获取所述实体提及的候选实体集，所述候选实体集至少包括一个候选实体，所述候选实体包括描述信息；Obtaining a candidate entity set mentioned by the entity from the knowledge graph, wherein the candidate entity set includes at least one candidate entity, and the candidate entity includes description information;

获取所述实体提及与每个候选实体的描述信息的第一关联度，并获取每个候选实体与所述待分析文本的第二关联度；Obtaining a first correlation between the entity mention and the description information of each candidate entity, and obtaining a second correlation between each candidate entity and the text to be analyzed;

基于所述第一关联度和所述第二关联度，从所述候选实体集中得到目标实体，并将所述实体提及与所述目标实体关联。Based on the first degree of association and the second degree of association, a target entity is obtained from the candidate entity set, and the entity mention is associated with the target entity.

根据本发明提供的一种实体链接方法，所述获取所述实体提及与每个候选实体的描述信息的第一关联度，并获取每个候选实体与所述待分析文本的第二关联度，包括：According to an entity linking method provided by the present invention, the obtaining of a first correlation between the entity mention and the description information of each candidate entity, and the obtaining of a second correlation between each candidate entity and the text to be analyzed, comprises:

获取所述实体提及与每个候选实体的描述信息的共现概率，并获取每个候选实体与所述待分析文本的共现概率；Obtaining the co-occurrence probability of the entity mention and the description information of each candidate entity, and obtaining the co-occurrence probability of each candidate entity and the text to be analyzed;

根据所述实体提及与每个候选实体的描述信息的共现概率，得到所述实体提及与每个候选实体的描述信息的第一关联度，并根据所述每个候选实体与所述待分析文本的共现概率，得到所述每个候选实体与所述待分析文本的第二关联度。According to the co-occurrence probability of the entity mention and the description information of each candidate entity, a first correlation degree between the entity mention and the description information of each candidate entity is obtained, and according to the co-occurrence probability of each candidate entity and the text to be analyzed, a second correlation degree between each candidate entity and the text to be analyzed is obtained.

根据本发明提供的一种实体链接方法，所述获取所述实体提及与每个候选实体的描述信息的共现概率，并获取每个候选实体与所述待分析文本的共现概率，包括：According to an entity linking method provided by the present invention, obtaining the co-occurrence probability of the entity mention and the description information of each candidate entity, and obtaining the co-occurrence probability of each candidate entity and the text to be analyzed includes:

获取所述实体提及和每个候选实体的描述信息中各词汇组合得到的二元词汇组的频率，并基于所述实体提及和每个候选实体的描述信息中各词汇组合得到的二元词汇组的频率，得到所述实体提及与每个候选实体的描述信息的共现概率；Obtaining the frequency of bigrams obtained by combining each word in the entity mention and the description information of each candidate entity, and obtaining the co-occurrence probability of the entity mention and the description information of each candidate entity based on the frequency of bigrams obtained by combining each word in the entity mention and the description information of each candidate entity;

获取所述每个候选实体与所述待分析文本中各词汇组合得到的二元词汇组的频率，并基于所述每个候选实体与所述待分析文本中各词汇组合得到的二元词汇组的频率，得到所述每个候选实体与所述待分析文本的共现概率。The frequency of the bigrams obtained by combining each candidate entity with each word in the text to be analyzed is obtained, and based on the frequency of the bigrams obtained by combining each candidate entity with each word in the text to be analyzed, the co-occurrence probability of each candidate entity and the text to be analyzed is obtained.

根据本发明提供的一种实体链接方法，所述二元词汇组的频率是基于预设的基础语料库中的文本统计得到。According to an entity linking method provided by the present invention, the frequency of the bigram vocabulary group is obtained based on text statistics in a preset basic corpus.

根据本发明提供的一种实体链接方法，所述从所述知识图谱中获取所述实体提及的候选实体集，包括：According to an entity linking method provided by the present invention, the step of obtaining a candidate entity set mentioned by the entity from the knowledge graph includes:

获取所述实体提及的别称；Get the aliases mentioned by the entity;

基于所述实体提及和所述别称，从所述知识图谱中匹配得到所述实体提及的候选实体集。Based on the entity mention and the alias, a candidate entity set of the entity mention is matched from the knowledge graph.

根据本发明提供的一种实体链接方法，所述基于所述第一关联度和所述第二关联度，从所述候选实体集中得到目标实体，包括：According to an entity linking method provided by the present invention, obtaining a target entity from the candidate entity set based on the first degree of association and the second degree of association includes:

获取所述第一关联度和所述第二关联度的加权平均值，并将所述加权平均值作为综合关联度量值；Obtaining a weighted average value of the first degree of association and the second degree of association, and using the weighted average value as a comprehensive association measurement value;

基于所述综合关联度量值和所述实体提及与每个候选实体的描述信息的第一关联度，得到所述目标实体。The target entity is obtained based on the comprehensive association metric and a first degree of association between the entity mention and the description information of each candidate entity.

根据本发明提供的一种实体链接方法，所述基于所述综合关联度量值和所述实体提及与每个候选实体的描述信息的第一关联度，得到所述目标实体，包括：According to an entity linking method provided by the present invention, obtaining the target entity based on the comprehensive association metric value and the first association degree between the entity mention and the description information of each candidate entity includes:

获取大于所述综合关联度量值的第一关联度；Obtaining a first degree of association greater than the comprehensive association metric value;

将大于所述综合关联度量值的第一关联度中的最大第一关联度对应的候选实体作为所述目标实体。The candidate entity corresponding to the maximum first degree of association among the first degrees of association that are greater than the comprehensive association metric value is taken as the target entity.

本发明还提供一种实体链接装置，包括：The present invention also provides a physical linking device, comprising:

提取模块，用于提取待分析文本的实体提及；The extraction module is used to extract entity mentions from the text to be analyzed;

候选实体获取模块，用于从知识图谱中获取所述实体提及的候选实体集，所述候选实体集至少包括一个候选实体，所述候选实体包括描述信息；A candidate entity acquisition module, used to acquire a candidate entity set mentioned by the entity from the knowledge graph, wherein the candidate entity set includes at least one candidate entity, and the candidate entity includes description information;

关联度计算模块，用于获取所述实体提及与每个候选实体的描述信息的第一关联度，并获取每个候选实体与所述待分析文本的第二关联度；A relevance calculation module, used to obtain a first relevance between the entity mention and the description information of each candidate entity, and to obtain a second relevance between each candidate entity and the text to be analyzed;

实体链接模块，用于基于所述第一关联度和所述第二关联度，从所述候选实体集中得到目标实体，并将所述实体提及与所述目标实体关联。An entity linking module is used to obtain a target entity from the candidate entity set based on the first degree of association and the second degree of association, and associate the entity mention with the target entity.

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述实体链接方法的步骤。The present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the program, the steps of any of the above-mentioned entity linking methods are implemented.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述实体链接方法的步骤。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of any of the entity linking methods described above are implemented.

本发明提供的实体链接方法、装置、电子设备及存储介质，首先从知识图谱中获取如与实体提及的名称相关的多个候选实体，然后，可以基于如大数据等确定出实体提及与每个候选实体的描述信息中出现的各词汇共同出现的情况，以得到第一关联度、以及每个候选实体与实体提及的待分析文本中各词汇的共同出现的情况，以得到第二关联度，进而，可以第一关联度和第二关联度的关联的密切程度准确地捕捉到实体提及与多个候选实体之间的关联性，并根据实体提及与多个候选实体之间的关联性确定出链接的实体，进而，保证了实体提及能够准确地链接到知识图谱中对应的实体上，提升了实体链接的准确性和可靠性，有效地扩充了知识图谱的规模。The entity linking method, device, electronic device and storage medium provided by the present invention first obtain multiple candidate entities such as names related to entity mentions from the knowledge graph, and then, based on big data, determine the co-occurrence of entity mentions and each word appearing in the description information of each candidate entity to obtain a first correlation, and the co-occurrence of each candidate entity with each word in the text to be analyzed of the entity mention to obtain a second correlation. Furthermore, the correlation between the entity mention and the multiple candidate entities can be accurately captured based on the closeness of the correlation between the first correlation and the second correlation, and the linked entities are determined based on the correlation between the entity mention and the multiple candidate entities, thereby ensuring that the entity mention can be accurately linked to the corresponding entity in the knowledge graph, improving the accuracy and reliability of entity linking, and effectively expanding the scale of the knowledge graph.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present invention or the prior art, the following briefly introduces the drawings required for use in the embodiments or the description of the prior art. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1是本发明提供的实体链接方法的流程示意图；FIG1 is a schematic diagram of a flow chart of an entity linking method provided by the present invention;

图2是本发明提供的实体链接方法的示意图；FIG2 is a schematic diagram of an entity linking method provided by the present invention;

图3是本发明提供的实体链接装置的结构示意图；FIG3 is a schematic structural diagram of a physical linking device provided by the present invention;

图4是本发明提供的电子设备的结构示意图。FIG. 4 is a schematic diagram of the structure of an electronic device provided by the present invention.

具体实施方式DETAILED DESCRIPTION

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with the drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

以下结合附图描述本发明实施例的实体链接方法、装置、电子设备及存储介质。The following describes the entity linking method, device, electronic device and storage medium according to the embodiments of the present invention in conjunction with the accompanying drawings.

其中，实体链接是将外部的实体提及关联到已有的知识图谱中对应的实体上，这样，一方面可以扩展实体提及的语义信息，另一方面能够更准确地扩充知识图谱的规模。Among them, entity linking is to associate external entity mentions with the corresponding entities in the existing knowledge graph. In this way, on the one hand, the semantic information of entity mentions can be expanded, and on the other hand, the scale of the knowledge graph can be expanded more accurately.

在以上描述中，实体提及是从待分析文本中提取的实体信息，如待分析文本中的主语，该实体提及作为待链接对象，即：待关联到已有的知识图谱中对应的实体上。In the above description, entity mention refers to entity information extracted from the text to be analyzed, such as the subject in the text to be analyzed. The entity mention serves as an object to be linked, that is, to be associated with the corresponding entity in the existing knowledge graph.

图1是根据本发明一个实施例的实体链接方法的流程图。如图1所示，根据本发明实施例的实体链接方法，包括如下步骤：FIG1 is a flow chart of an entity linking method according to an embodiment of the present invention. As shown in FIG1 , the entity linking method according to an embodiment of the present invention includes the following steps:

S101：提取待分析文本的实体提及。S101: Extract entity mentions from the text to be analyzed.

如图2所示，可以通过实体识别的方式进行待分析文本中的实体提及的提取。例如：依据预先准备好的命名实体识别模型(简称NER模型，Named Entity Recognition模型)实现实体提及的自动提取，其中，命名实体识别模型中采用命名实体识别算法进行待分析文本中的实体提及的提取，由此，通过对待分析文本进行字符级别的预处理并得到命名实体识别模型输入数据后，命名实体识别模型会自动完成命名实体序列标注，进而，根据预测的标注序列完成待分析文本中实体提及的提取。As shown in FIG2 , entity mentions in the text to be analyzed can be extracted by entity recognition. For example, automatic extraction of entity mentions is achieved based on a pre-prepared named entity recognition model (NER model for short), wherein a named entity recognition algorithm is used in the named entity recognition model to extract entity mentions in the text to be analyzed, and thus, after character-level preprocessing of the text to be analyzed and input data of the named entity recognition model is obtained, the named entity recognition model automatically completes the named entity sequence annotation, and then completes the extraction of entity mentions in the text to be analyzed according to the predicted annotation sequence.

待分析文本例如为“张三毕业于中央戏剧学院表演系，中国辽宁籍男演员”，则通过命名实体识别模型提取的实体提及可以包括“张三”、“中国”、“演员”等。For example, if the text to be analyzed is "Zhang San graduated from the Performance Department of the Central Academy of Drama, and is a male actor from Liaoning, China", the entity mentions extracted by the named entity recognition model may include "Zhang San", "China", "actor", etc.

S102：从知识图谱中获取实体提及的候选实体集，候选实体集至少包括一个候选实体，候选实体包括描述信息。S102: Obtain a candidate entity set of entity mentions from the knowledge graph, where the candidate entity set includes at least one candidate entity, and the candidate entity includes description information.

知识图谱中包括多个实体，候选实体集是从知识图谱中的多个实体中选择出的一个或多个实体组成的集合，其中，实体提及需要关联的实体是候选实体集中的实体，因此，本发明的实施例中，候选实体集中的实体称为候选实体。The knowledge graph includes multiple entities, and the candidate entity set is a set of one or more entities selected from the multiple entities in the knowledge graph, wherein the entity that the entity mention needs to be associated with is the entity in the candidate entity set. Therefore, in an embodiment of the present invention, the entity in the candidate entity set is called a candidate entity.

如图2所示，候选实体的生成，即：从知识图谱中获取实体提及的候选实体集，包括：获取实体提及的别称；基于实体提及和别称，从知识图谱中匹配得到实体提及的候选实体集。候选实体集中的候选实体包括描述信息，描述信息是候选实体的实体属性，通常是一段文本，例如：对于一段已经链接到知识图谱中的文本“孙悟空大闹天宫的剧情很精彩”，对于该文本，在知识图谱中实体可以是“孙悟空”，其对应的描述信息为“大闹天宫的剧情很精彩”，当然，也可以将每个候选实体的实体属性单独获取，则候选实体集对应一个候选实体属性集。As shown in Figure 2, the generation of candidate entities, i.e., obtaining a candidate entity set of entity mentions from the knowledge graph, includes: obtaining the aliases of entity mentions; based on entity mentions and aliases, matching the candidate entity set of entity mentions from the knowledge graph. The candidate entities in the candidate entity set include description information, which is the entity attribute of the candidate entity, usually a piece of text. For example, for a piece of text that has been linked to the knowledge graph, "The plot of Monkey King's havoc in heaven is very exciting", for this text, the entity in the knowledge graph can be "Sun Wukong", and its corresponding description information is "The plot of havoc in heaven is very exciting". Of course, the entity attributes of each candidate entity can also be obtained separately, and the candidate entity set corresponds to a candidate entity attribute set.

在本发明的一个实施例中，候选实体的生成可根据实体提及以及别称，通过正则表达式从知识图谱中匹配得到的名称集合。即：这些名称集合与知识图谱中的实体的名称存在对应关系，也就是说，名称及合中的名称对应知识图谱中的实体。具体过程如下：In one embodiment of the present invention, candidate entities can be generated based on entity mentions and aliases, and a set of names matched from the knowledge graph through regular expressions. That is, these name sets correspond to the names of entities in the knowledge graph, that is, the names in the name and combination correspond to the entities in the knowledge graph. The specific process is as follows:

根据实体提及查询预存的别称库，该别称库中存在实体提及以及其别称，通过别称库可以扩展实体提及的可能的名称，由于实体提及的别称有可能与实体提及的本名使用的汉字无交集，导致不能用正则表达式的方式进行匹配查询，因此，首先获取到实体提及的别称，进而，可以根据别称和实体提及分别进行匹配查询。例如：电影实体名“天堂电影院”的别称为“星光伴我心”。则别称库中记录的是各实体对应的别称，格式为但不限于二元组形式，如：<天堂电影院，星光伴我心>。According to the entity mention query, the pre-stored alias library contains entity mentions and their aliases. The alias library can expand the possible names of entity mentions. Since the aliases of entity mentions may have no intersection with the Chinese characters used in the real name of the entity mention, it is impossible to match and query using regular expressions. Therefore, first obtain the aliases of entity mentions, and then match and query based on the aliases and entity mentions. For example: the alias of the movie entity name "Paradise Cinema" is "Starlight Accompanies My Heart". The alias library records the aliases corresponding to each entity, in the format of but not limited to two-tuple form, such as: <Paradise Cinema, Starlight Accompanies My Heart>.

在得到实体提及的别称后，根据实体提及和别称，应用正则表达式或者编辑距离等方式，从知识图谱中得到候选实体集，其中，正则表达式匹配的方式包括但不限于如下形式：After obtaining the aliases of entity mentions, a candidate entity set is obtained from the knowledge graph by applying regular expressions or edit distance based on the entity mentions and aliases. The regular expression matching method includes but is not limited to the following forms:

候选实体集中的候选实体的名称是包含实体提及或者其别称的实体，或者，实体提及或者别称中包含知识图谱中候选实体集中的候选实体的名称。The name of a candidate entity in the candidate entity set is an entity that contains an entity mention or its alias, or the entity mention or alias contains the name of a candidate entity in the candidate entity set in the knowledge graph.

编辑距离描述的是针对字符级的变化次数统计，使其中一个字符序列转变为另一个字符序列。通过计算两个实体的名称的编辑距离，并与设定好的阈值比较，进而，从知识图谱中筛选出可能的实体集作为候选实体集。The edit distance describes the number of character-level changes that transform one character sequence into another. By calculating the edit distance between the names of two entities and comparing it with a set threshold, possible entity sets are screened out from the knowledge graph as candidate entity sets.

S103：获取实体提及与每个候选实体的描述信息的第一关联度，并获取每个候选实体与待分析文本的第二关联度。S103: Obtain a first correlation between the entity mention and the description information of each candidate entity, and obtain a second correlation between each candidate entity and the text to be analyzed.

在本发明的一个实施例中，获取实体提及与每个候选实体的描述信息的第一关联度，并获取每个候选实体与待分析文本的第二关联度，包括：获取实体提及与每个候选实体的描述信息的共现概率，并获取每个候选实体与待分析文本的共现概率；根据实体提及与每个候选实体的描述信息的共现概率，得到实体提及与每个候选实体的描述信息的第一关联度，并根据每个候选实体与待分析文本的共现概率，得到每个候选实体与待分析文本的第二关联度。In one embodiment of the present invention, a first degree of association between an entity mention and the description information of each candidate entity is obtained, and a second degree of association between each candidate entity and the text to be analyzed is obtained, including: obtaining the co-occurrence probability between the entity mention and the description information of each candidate entity, and obtaining the co-occurrence probability between each candidate entity and the text to be analyzed; obtaining the first degree of association between the entity mention and the description information of each candidate entity based on the co-occurrence probability between the entity mention and the description information of each candidate entity, and obtaining the second degree of association between each candidate entity and the text to be analyzed based on the co-occurrence probability between each candidate entity and the text to be analyzed.

该示例中，获取实体提及与每个候选实体的描述信息的共现概率，并获取每个候选实体与待分析文本的共现概率，包括：获取实体提及和每个候选实体的描述信息中各词汇组合得到的二元词汇组的频率，并基于实体提及和每个候选实体的描述信息中各词汇组合得到的二元词汇组的频率，得到实体提及与每个候选实体的描述信息的共现概率；获取每个候选实体与待分析文本中各词汇组合得到的二元词汇组的频率，并基于每个候选实体与待分析文本中各词汇组合得到的二元词汇组的频率，得到每个候选实体与待分析文本的共现概率。In this example, the co-occurrence probability of entity mentions and the description information of each candidate entity is obtained, and the co-occurrence probability of each candidate entity and the text to be analyzed is obtained, including: obtaining the frequency of two-gram word groups obtained by combining each word in the entity mention and the description information of each candidate entity, and based on the frequency of two-gram word groups obtained by combining each word in the entity mention and the description information of each candidate entity, obtaining the co-occurrence probability of the entity mention and the description information of each candidate entity; obtaining the frequency of two-gram word groups obtained by combining each candidate entity with each word in the text to be analyzed, and based on the frequency of two-gram word groups obtained by combining each candidate entity with each word in the text to be analyzed, obtaining the co-occurrence probability of each candidate entity with the text to be analyzed.

在上述示例中，二元词汇组的频率例如是基于预设的基础语料库中的文本统计得到。其中，基础语料库中包括多个文本，即：记录有多个文本，记录的文本可以预先收集得到，例如：通过人工进行收集、通过网络上进行收集等。In the above example, the frequency of the bigram is obtained based on text statistics in a preset basic corpus, wherein the basic corpus includes multiple texts, that is, multiple texts are recorded, and the recorded texts can be collected in advance, for example, manually or online.

在以上描述中，第一关联度表示的是实体提及与每个候选实体的描述信息中各词汇之间关联的密切程度，例如：关联的密切程度可以通过在大数据中共同出现的概率确定，可以是共同出现的概率越大，密切程度越大，则说明实体提及与该候选实体之间的关联性越强；同样地，第二关联度表示的是每个候选实体与待分析文本中各词汇之间关联的密切程度。In the above description, the first degree of association represents the closeness of the association between the entity mention and the words in the description information of each candidate entity. For example, the closeness of the association can be determined by the probability of co-occurrence in big data. The greater the probability of co-occurrence, the greater the closeness, which means that the association between the entity mention and the candidate entity is stronger. Similarly, the second degree of association represents the closeness of the association between each candidate entity and the words in the text to be analyzed.

第一关联度和第二关联度是与共现概率相关的，因此，首先需要确定共现概率，其中，共现概率是共同出现的概率。具体来说，如图2所示，共现概率生成则是基于基础语料库(即：大数据)的。也就是说，共现概率的生成主要依据基础语料库中的文本记录，统计不同二元词汇组合的频率作为共现概率，并进行存储。生成过程如下：The first correlation and the second correlation are related to the co-occurrence probability. Therefore, the co-occurrence probability needs to be determined first, where the co-occurrence probability is the probability of co-occurrence. Specifically, as shown in Figure 2, the co-occurrence probability generation is based on the basic corpus (ie, big data). In other words, the generation of the co-occurrence probability is mainly based on the text records in the basic corpus, and the frequency of different bigram word combinations is counted as the co-occurrence probability and stored. The generation process is as follows:

首先对文本进行分词，统计每个词汇出现的次数，然后依次以每个词汇作为头词汇，统计头词汇与其他任意词汇组成的二元词汇组出现的次数，进而得到文本记录中的每个词汇对应的统计信息，例如：<词汇i，词汇j>：<词汇i出现次数，词汇i和词汇j共同出现次数>。其中，共现描述的是文本中同时出现的情况。可以先对文本进行分词和实体识别，并针对分词结果去除停用词，停用词包括但不限于：助词、副词和介词等虚词，得到去重后的词汇集和实体集。针对文本得到统计结果通常为各实体出现一次，各实体与各词汇共现一次。First, the text is segmented and the number of times each word appears is counted. Then, each word is used as the head word in turn, and the number of times the binary word group consisting of the head word and any other words appears is counted, so as to obtain the statistical information corresponding to each word in the text record, for example: <word i, word j>: <number of times word i appears, number of times word i and word j appear together>. Among them, co-occurrence describes the situation of simultaneous appearance in the text. The text can be segmented and entity recognized first, and stop words can be removed from the segmentation results. Stop words include but are not limited to: auxiliary words, adverbs, prepositions and other function words, to obtain the deduplicated vocabulary set and entity set. The statistical results obtained for the text are usually that each entity appears once, and each entity co-occurs with each word once.

例如：针对文本S1：“张三毕业于中央戏剧学院表演系，中国辽宁籍男演员。”操作步骤分文以下步骤1和步骤2，其中，步骤1为：For example, for text S1: "Zhang San graduated from the Performance Department of the Central Academy of Drama, a male actor from Liaoning, China." The operation steps are divided into step 1 and step 2, where step 1 is:

分词，则S1的分词结果为“张三”，“毕业”，“中央”，“戏剧”，“学院”，“表演系”，“中国”，“辽宁”，“籍”，“男演员”。For word segmentation, the word segmentation results of S1 are "Zhang San", "graduation", "central", "drama", "college", "performance department", "China", "Liaoning", "nationality", and "actor".

实体识别，S1的实体识别结果为：张三。Entity recognition, the entity recognition result of S1 is: Zhang San.

统计频次，首先针对每个实体得到：To count the frequency, first get for each entity:

张三：1Zhang San: 1

再对实体与各词汇的二元共现组合得到：Then combine the binary co-occurrence of entities and words to get:

<张三，毕业>：<1，1>；<Zhang San, graduated>: <1, 1>;

<张三，中央>：<1，1>；<Zhang San, center>: <1, 1>;

<张三，戏剧>：<1，1>；<Zhang San, Drama>: <1, 1>;

<张三，学院>：<1，1>；<Zhang San, College>: <1, 1>;

<张三，表演系>：<1，1>；<Zhang San, Department of Performance>: <1, 1>;

<张三，中国>：<1，1>；<Zhang San, China>: <1, 1>;

<张三，辽宁>：<1，1>；<Zhang San, Liaoning>: <1, 1>;

<张三，籍>：<1，1>；<Zhang San, Nationality>: <1,1>;

<张三，男演员>：<1，1>。<Zhang San, actor>: <1, 1>.

步骤2：遍历基础语料库中的每一条文本记录。针对每一条文本记录重复上述的步骤1，得到统计结果后进行合并操作，合并过程中对应词汇及二元词汇组出现的次数进行相加，得到统计结果。如果不存在新的实体统计信息或者二元词汇组共现的统计信息，则新增该信息，否则在原记录统计结果中进行相加的更新操作。Step 2: Traverse each text record in the basic corpus. Repeat step 1 above for each text record, and perform a merge operation after obtaining the statistical results. During the merging process, add the number of occurrences of the corresponding words and bigrams to obtain the statistical results. If there is no new entity statistical information or bigram co-occurrence statistical information, add the information, otherwise perform an update operation in the original record statistical results.

针对统计结果中的二元词汇组合频次矩阵转化为概率形式，其中，共现概率p为对应的频率，即：The frequency matrix of the two-word combination in the statistical results is converted into a probability form, where the co-occurrence probability p is the corresponding frequency, that is:

其中，count()表示统计的次数。Among them, count() indicates the number of statistics.

关联度的计算分为第一关联度和第二关联度的计算，机：计算实体提及与各候选实体属性集的关联度，以及计算各候选实体与实体提及的上下文(即：待分析文本中除实体提及的剩余文本)的关联度，其中，关联度r计算方法如下：The calculation of the relevance is divided into the calculation of the first relevance and the second relevance. The first is to calculate the relevance between the entity mention and the attribute set of each candidate entity, and the second is to calculate the relevance between each candidate entity and the context of the entity mention (i.e., the remaining text in the text to be analyzed except the entity mention). The calculation method of the relevance r is as follows:

其中，n表示文本分词后词汇个数。最后得到第一关联度和第二关联度，再将第一关联度与第二关联度加权平均后作为各候选实体的综合关联度量值。其中，可以通过查询得到各个二元词汇组的共现概率。Where n represents the number of words after the text is segmented. Finally, the first relevance and the second relevance are obtained, and the weighted average of the first relevance and the second relevance is used as the comprehensive relevance measurement value of each candidate entity. The co-occurrence probability of each bigram group can be obtained by querying.

关联度的具体计算过程例如：The specific calculation process of correlation is as follows:

实体提及：张三。待分析文本S2为：2011年张三在《新水浒传》中饰演小李广花荣。Entity mentioned: Zhang San. The text to be analyzed S2 is: In 2011, Zhang San played the role of Little Li Guang Hua Rong in "New Water Margin".

对S2进行分词得到：张三、新水浒传、饰演、小李广、花荣。Segmenting S2 yields: Zhang San, New Water Margin, played by, Little Li Guang, Hua Rong.

根据实体提及和分词结果得到二元词汇组：<张三，新水浒传>，<张三，饰演>，<张三，小李广>，<张三，花荣>。According to the entity mention and word segmentation results, we get the bigram vocabulary groups: <Zhang San, New Water Margin>, <Zhang San, plays>, <Zhang San, Little Li Guang>, <Zhang San, Hua Rong>.

根据得到的二元词汇组，查询得到共现概率p1(张三，新水浒传)，共现概率p2(张三，饰演)，共现概率p3(张三，小李广)以及共现概率p4(张三，花荣)，进而，得到关联度为：According to the obtained two-word group, the co-occurrence probability p1 (Zhang San, New Water Margin), co-occurrence probability p2 (Zhang San, played by), co-occurrence probability p3 (Zhang San, Xiao Li Guang) and co-occurrence probability p4 (Zhang San, Hua Rong) are obtained, and then the correlation is obtained as follows:

以上是计算一个实体和一段文本之间的关联度，本发明的实施例中，综合关联度R计算方法为实体提及与各候选实体属性集(通常拼接成一段文本形式)之间的关联度r1...m，和各候选实体与实体提及上下文之间的关联度R1...m的加权平均值，其中，m表示候选实体个数。即The above is to calculate the relevance between an entity and a text. In the embodiment of the present invention, the comprehensive relevance R is calculated by the weighted average of the relevance r1...m between the entity mention and each candidate entity attribute set (usually concatenated into a text form), and the relevance R1...m between each candidate entity and the entity mention context, where m represents the number of candidate entities.

其中，w1和w2为权重，w1和w2为0-1之间的数值，可以根据需要进行设定。Among them, w1 and w2 are weights, w1 and w2 are values between 0 and 1, and can be set as needed.

S104：基于第一关联度和第二关联度，从候选实体集中得到目标实体，并将实体提及与目标实体关联。S104: Based on the first relevance and the second relevance, a target entity is obtained from the candidate entity set, and the entity mention is associated with the target entity.

在本发明的一个实施例中，基于所述第一关联度和所述第二关联度，从所述候选实体集中得到目标实体，包括：获取所述第一关联度和所述第二关联度的加权平均值，并将所述加权平均值作为综合关联度量值；基于所述综合关联度量值和所述实体提及与每个候选实体的描述信息的第一关联度，得到所述目标实体。In one embodiment of the present invention, based on the first degree of association and the second degree of association, a target entity is obtained from the candidate entity set, including: obtaining a weighted average of the first degree of association and the second degree of association, and using the weighted average as a comprehensive association metric; based on the comprehensive association metric and the first degree of association between the entity mention and the descriptive information of each candidate entity, the target entity is obtained.

该示例中，基于综合关联度量值和实体提及与每个候选实体的描述信息的第一关联度，得到所述目标实体，包括：获取大于所述综合关联度量值的第一关联度；将大于综合关联度量值的第一关联度中的最大第一关联度对应的候选实体作为目标实体。也就是说，得到实体提及对应候选实体集的综合关联度量值后，取最大的大于所述综合关联度量值的第一关联度与综合关联度量值比较，超过该综合关联度量值，表明可以链接，即：将实体提及与对应的实体进行关联，从而，准确地扩充了知识图谱的规模。In this example, the target entity is obtained based on the comprehensive association metric and the first association between the entity mention and the description information of each candidate entity, including: obtaining a first association greater than the comprehensive association metric; taking the candidate entity corresponding to the maximum first association among the first associations greater than the comprehensive association metric as the target entity. That is to say, after obtaining the comprehensive association metric of the candidate entity set corresponding to the entity mention, the largest first association greater than the comprehensive association metric is taken and compared with the comprehensive association metric. If it exceeds the comprehensive association metric, it indicates that it can be linked, that is, the entity mention is associated with the corresponding entity, thereby accurately expanding the scale of the knowledge graph.

根据本发明实施例的实体链接方法，首先从知识图谱中获取如与实体提及的名称相关的多个候选实体，然后，可以基于如大数据等确定出实体提及与每个候选实体的描述信息中出现的各词汇共同出现的情况，以得到第一关联度、以及每个候选实体与实体提及的待分析文本中各词汇的共同出现的情况，以得到第二关联度，进而，可以第一关联度和第二关联度的关联的密切程度准确地捕捉到实体提及与多个候选实体之间的关联性，并根据实体提及与多个候选实体之间的关联性确定出链接的实体，进而，保证了实体提及能够准确地链接到知识图谱中对应的实体上，提升了实体链接的准确性和可靠性，有效地扩充了知识图谱的规模。According to the entity linking method of an embodiment of the present invention, first, multiple candidate entities related to the names of entity mentions are obtained from the knowledge graph. Then, based on big data, the co-occurrence of entity mentions and each word appearing in the description information of each candidate entity can be determined to obtain a first correlation, and the co-occurrence of each candidate entity with each word in the text to be analyzed of the entity mention can be determined to obtain a second correlation. Furthermore, the correlation between the entity mention and the multiple candidate entities can be accurately captured based on the closeness of the correlation between the first correlation and the second correlation, and the linked entities are determined based on the correlation between the entity mention and the multiple candidate entities. Furthermore, it is ensured that the entity mention can be accurately linked to the corresponding entity in the knowledge graph, thereby improving the accuracy and reliability of entity linking and effectively expanding the scale of the knowledge graph.

与现有的实体链接技术相比，如：“孙悟空大闹天宫”一文中“孙悟空”是实体提及，在知识图谱中存在一个候选实体“猴哥”，其描述为“猴哥西天取经”此时，现有技术中，基于语义计算和基于文本结构的词汇比较的实体链接技术均不能够将实体提及“孙悟空”与知识图谱中的候选实体“猴哥”链接，然而，实际上两者之间是可以链接的。而通过本发明实施例的实体链接方法，可以分析出“孙悟空”与“西天取经”之间的关联性，同样地，可以确定出“猴哥”与“大闹天宫”之间的关联性，进而，能够保证实体提及准确地链接到知识图谱中对应的实体上。Compared with the existing entity linking technology, for example, in the article "Sun Wukong makes trouble in heaven", "Sun Wukong" is an entity mention, and there is a candidate entity "Monkey Brother" in the knowledge graph, which is described as "Monkey Brother goes to the West to seek scriptures". At this time, in the existing technology, the entity linking technology based on semantic calculation and the lexical comparison based on text structure cannot link the entity mention "Sun Wukong" with the candidate entity "Monkey Brother" in the knowledge graph, but in fact the two can be linked. Through the entity linking method of the embodiment of the present invention, the correlation between "Sun Wukong" and "Seeking scriptures in the West" can be analyzed. Similarly, the correlation between "Monkey Brother" and "Making trouble in heaven" can be determined, and then, it can ensure that the entity mention is accurately linked to the corresponding entity in the knowledge graph.

下面对本发明提供的实体链接装置进行描述，下文描述的实体链接装置与上文描述的实体链接方法可相互对应参照。The entity linking device provided by the present invention is described below. The entity linking device described below and the entity linking method described above can be referred to each other.

如图3所示，根据本发明一个实施例的实体链接装置，包括：提取模块310、候选实体获取模块320、关联度计算模块330和实体链接模块340，其中：As shown in FIG3 , the entity linking device according to an embodiment of the present invention includes: an extraction module 310, a candidate entity acquisition module 320, a relevance calculation module 330 and an entity linking module 340, wherein:

提取模块310，用于提取待分析文本的实体提及；An extraction module 310 for extracting entity mentions from the text to be analyzed;

候选实体获取模块320，用于从知识图谱中获取所述实体提及的候选实体集，所述候选实体集至少包括一个候选实体，所述候选实体包括描述信息；A candidate entity acquisition module 320 is used to acquire a candidate entity set mentioned by the entity from the knowledge graph, wherein the candidate entity set includes at least one candidate entity, and the candidate entity includes description information;

关联度计算模块330，用于获取所述实体提及与每个候选实体的描述信息的第一关联度，并获取每个候选实体与所述待分析文本的第二关联度；A relevance calculation module 330 is used to obtain a first relevance between the entity mention and the description information of each candidate entity, and to obtain a second relevance between each candidate entity and the text to be analyzed;

实体链接模块340，用于基于所述第一关联度和所述第二关联度，从所述候选实体集中得到目标实体，并将所述实体提及与所述目标实体关联。The entity linking module 340 is configured to obtain a target entity from the candidate entity set based on the first degree of association and the second degree of association, and associate the entity mention with the target entity.

根据本发明实施例的实体链接装置，首先从知识图谱中获取如与实体提及的名称相关的多个候选实体，然后，可以基于如大数据等确定出实体提及与每个候选实体的描述信息中出现的各词汇共同出现的情况，以得到第一关联度、以及每个候选实体与实体提及的待分析文本中各词汇的共同出现的情况，以得到第二关联度，进而，可以第一关联度和第二关联度的关联的密切程度准确地捕捉到实体提及与多个候选实体之间的关联性，并根据实体提及与多个候选实体之间的关联性确定出链接的实体，进而，保证了实体提及能够准确地链接到知识图谱中对应的实体上，提升了实体链接的准确性和可靠性，有效地扩充了知识图谱的规模。According to the entity linking device of the embodiment of the present invention, first, multiple candidate entities such as names related to entity mentions are obtained from the knowledge graph, and then, based on big data, the co-occurrence of entity mentions and words appearing in the description information of each candidate entity can be determined to obtain a first correlation, and the co-occurrence of each candidate entity with words in the text to be analyzed of the entity mentions can be determined to obtain a second correlation. Furthermore, the correlation between the entity mention and the multiple candidate entities can be accurately captured based on the closeness of the correlation between the first correlation and the second correlation, and the linked entities are determined based on the correlation between the entity mention and the multiple candidate entities, thereby ensuring that the entity mentions can be accurately linked to the corresponding entities in the knowledge graph, improving the accuracy and reliability of entity linking, and effectively expanding the scale of the knowledge graph.

图4示例了一种电子设备的实体结构示意图，如图4所示，该电子设备可以包括：处理器(processor)410、通信接口(CommunicationsInterface)420、存储器(memory)430和通信总线440，其中，处理器410，通信接口420，存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储器430中的逻辑指令，以执行实体链接方法，该方法包括：提取待分析文本的实体提及；从知识图谱中获取所述实体提及的候选实体集，所述候选实体集至少包括一个候选实体，所述候选实体包括描述信息；获取所述实体提及与每个候选实体的描述信息的第一关联度，并获取每个候选实体与所述待分析文本的第二关联度；基于所述第一关联度和所述第二关联度，从所述候选实体集中得到目标实体，并将所述实体提及与所述目标实体关联。FIG4 illustrates a schematic diagram of an entity structure of an electronic device. As shown in FIG4 , the electronic device may include: a processor 410, a communication interface 420, a memory 430 and a communication bus 440, wherein the processor 410, the communication interface 420 and the memory 430 communicate with each other through the communication bus 440. The processor 410 may call the logic instructions in the memory 430 to execute the entity linking method, which includes: extracting entity mentions of the text to be analyzed; obtaining a candidate entity set of the entity mentions from the knowledge graph, the candidate entity set including at least one candidate entity, and the candidate entity including description information; obtaining a first degree of association between the entity mention and the description information of each candidate entity, and obtaining a second degree of association between each candidate entity and the text to be analyzed; based on the first degree of association and the second degree of association, obtaining a target entity from the candidate entity set, and associating the entity mention with the target entity.

此外，上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-OnlyMemory)、随机存取存储器(RAM，RandomAccessMemory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned memory 430 can be implemented in the form of a software functional unit and can be stored in a computer-readable storage medium when it is sold or used as an independent product. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including a number of instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in each embodiment of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), disk or optical disk and other media that can store program codes.

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序，所述计算机程序包括程序指令，当所述程序指令被计算机执行时，计算机能够执行上述各方法所提供的实体链接方法，该方法包括：提取待分析文本的实体提及；从知识图谱中获取所述实体提及的候选实体集，所述候选实体集至少包括一个候选实体，所述候选实体包括描述信息；获取所述实体提及与每个候选实体的描述信息的第一关联度，并获取每个候选实体与所述待分析文本的第二关联度；基于所述第一关联度和所述第二关联度，从所述候选实体集中得到目标实体，并将所述实体提及与所述目标实体关联。On the other hand, the present invention also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium, and the computer program includes program instructions. When the program instructions are executed by a computer, the computer can execute the entity linking method provided by the above methods, and the method includes: extracting entity mentions from the text to be analyzed; obtaining a candidate entity set of the entity mentions from a knowledge graph, the candidate entity set including at least one candidate entity, and the candidate entity including description information; obtaining a first correlation between the entity mentions and the description information of each candidate entity, and obtaining a second correlation between each candidate entity and the text to be analyzed; based on the first correlation and the second correlation, obtaining a target entity from the candidate entity set, and associating the entity mention with the target entity.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各提供的实体链接方法，该方法包括：提取待分析文本的实体提及；从知识图谱中获取所述实体提及的候选实体集，所述候选实体集至少包括一个候选实体，所述候选实体包括描述信息；获取所述实体提及与每个候选实体的描述信息的第一关联度，并获取每个候选实体与所述待分析文本的第二关联度；基于所述第一关联度和所述第二关联度，从所述候选实体集中得到目标实体，并将所述实体提及与所述目标实体关联。On the other hand, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to execute the above-mentioned entity linking methods, the method comprising: extracting entity mentions from the text to be analyzed; obtaining a candidate entity set of the entity mentions from a knowledge graph, the candidate entity set including at least one candidate entity, the candidate entity including description information; obtaining a first degree of association between the entity mentions and the description information of each candidate entity, and obtaining a second degree of association between each candidate entity and the text to be analyzed; based on the first degree of association and the second degree of association, obtaining a target entity from the candidate entity set, and associating the entity mentions with the target entity.

以上所描述的装置实施例仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例的方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art may understand and implement it without creative work.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that each implementation method can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware. Based on this understanding, the above technical solution is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, a disk, an optical disk, etc., including a number of instructions for a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in each embodiment or some parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of entity linking comprising:

Extracting entity references of the text to be analyzed;

Acquiring a candidate entity set mentioned by the entity from a knowledge graph, wherein the candidate entity set at least comprises one candidate entity, and the candidate entity comprises descriptive information;

acquiring a first association degree of the entity mention and the description information of each candidate entity, and acquiring a second association degree of each candidate entity and the text to be analyzed;

Obtaining a target entity from the candidate entity set based on the first association degree and the second association degree, and associating the entity mention with the target entity;

the obtaining the first association degree of the entity mentioned and the description information of each candidate entity, and the obtaining the second association degree of each candidate entity and the text to be analyzed, includes:

acquiring the co-occurrence probability of the entity mentioned and the description information of each candidate entity, and acquiring the co-occurrence probability of each candidate entity and the text to be analyzed;

And obtaining a first association degree of the entity mention and the description information of each candidate entity according to the co-occurrence probability of the entity mention and the description information of each candidate entity, and obtaining a second association degree of each candidate entity and the text to be analyzed according to the co-occurrence probability of each candidate entity and the text to be analyzed.

2. The method of claim 1, wherein the obtaining the co-occurrence probability of the entity referring to the description information of each candidate entity and obtaining the co-occurrence probability of each candidate entity and the text to be analyzed includes:

acquiring the frequency of a binary vocabulary group obtained by combining the vocabularies in the description information of the entity mention and each candidate entity, and acquiring the co-occurrence probability of the entity mention and the description information of each candidate entity based on the frequency of the binary vocabulary group obtained by combining the vocabularies in the description information of the entity mention and each candidate entity;

and obtaining the frequency of the binary vocabulary groups obtained by combining each candidate entity with each vocabulary in the text to be analyzed, and obtaining the co-occurrence probability of each candidate entity and the text to be analyzed based on the frequency of the binary vocabulary groups obtained by combining each candidate entity with each vocabulary in the text to be analyzed.

3. The method of claim 2, wherein the frequency of the binary vocabulary groups is based on text statistics in a predetermined basic corpus.

4. The method for entity linking according to claim 1, wherein the obtaining the candidate entity set mentioned by the entity from the knowledge-graph includes:

Acquiring the other names mentioned by the entity;

And matching from the knowledge graph to obtain a candidate entity set mentioned by the entity based on the entity mention and the generic term.

5. The method of entity linking according to any one of claims 1-4, wherein the obtaining a target entity from the candidate entity set based on the first degree of association and the second degree of association includes:

Acquiring a weighted average value of the first association degree and the second association degree, and taking the weighted average value as a comprehensive association measurement value;

And obtaining the target entity based on the comprehensive association metric value and the first association degree of the entity with the description information of each candidate entity.

6. The entity linking method according to claim 5, wherein the obtaining the target entity based on the integrated association metric value and the first degree of association of the entity with the description information of each candidate entity includes:

acquiring a first association degree larger than the comprehensive association degree value;

And taking the candidate entity corresponding to the maximum first association degree in the first association degrees larger than the comprehensive association degree value as the target entity.

7. An entity linking apparatus, comprising:

The extraction module is used for extracting entity mention of the text to be analyzed;

The candidate entity acquisition module is used for acquiring a candidate entity set mentioned by the entity from the knowledge graph, wherein the candidate entity set at least comprises one candidate entity, and the candidate entity comprises descriptive information;

The association degree calculation module is used for acquiring a first association degree of the entity mentioned and the description information of each candidate entity and acquiring a second association degree of each candidate entity and the text to be analyzed;

the entity link module is used for obtaining a target entity from the candidate entity set based on the first association degree and the second association degree, and associating the entity mention with the target entity;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the entity linking method according to any of claims 1 to 6 when the program is executed by the processor.

9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the entity linking method according to any one of claims 1 to 6.