CN113392182A

CN113392182A - Knowledge matching method, device, equipment and medium fusing context semantic constraints

Info

Publication number: CN113392182A
Application number: CN202110509526.2A
Authority: CN
Inventors: 王永斌; 张忠平; 肖益珊; 刘廉如; 季文翀; 曾汉; 温振山; 黄永; 郑涛
Original assignee: Yitong Century Internet Of Things Research Institute Guangzhou Co ltd
Current assignee: Yitong Century Internet Of Things Research Institute Guangzhou Co ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-09-14
Anticipated expiration: 2041-05-11
Also published as: CN113392182B

Abstract

The invention discloses a knowledge matching method, device, equipment and medium integrating context semantic constraints. The method includes: performing knowledge extraction on an acquired input sentence to obtain an entity vector and a relation vector in the input sentence; converting the input sentence into is a text vector; splicing the entity vector, the relationship vector and the text vector to obtain a target vector; performing similarity matching between the target vector and the vector records in the knowledge base to determine a set of similar vectors; The vector with the highest similarity in the vector set is used as the matching result. The invention improves the efficiency, reduces the time cost and labor cost for producing the labeling data, and can be widely used in the technical field of natural language processing.

Description

Knowledge matching method, device, device and medium incorporating context semantic constraints

技术领域technical field

本发明涉及自然语言处理技术领域，尤其是融合上下文语义约束的知识匹配方法、装置、设备及介质。The present invention relates to the technical field of natural language processing, in particular to a knowledge matching method, apparatus, device and medium integrating context semantic constraints.

背景技术Background technique

随着人工智能的发展，经历了从计算智能到感知智能阶段，正迈向认知智能阶段。要实现认知智能，机器必须像人类一样，学会理解、推理和解释知识，这正是当下人工智能急需解决的一大难题。目前通过将深度学习和知识图谱推理相结合，能够很好的解决人类自然语言的语义鸿沟。知识图谱是以结构化的方式描述客观世界中的概念、实体及其属性的关系，通常表现为三元组，即[实体1，关系，实体2]或[实体，属性，属性值]。知识图谱能够更好的组织、管理和理解互联网海量信息，可以将互联网的信息转化成更接近于人类认知世界的形式。因此其被广泛应用于智能检索、智能问答、个性化推荐等领域。With the development of artificial intelligence, it has gone through the stage from computational intelligence to perceptual intelligence, and is moving towards the stage of cognitive intelligence. To achieve cognitive intelligence, machines must learn to understand, reason, and interpret knowledge like humans, which is a major problem that artificial intelligence urgently needs to solve. At present, by combining deep learning and knowledge graph reasoning, the semantic gap of human natural language can be well resolved. Knowledge graph is a structured way to describe the relationship between concepts, entities and their attributes in the objective world, usually expressed as triples, namely [entity 1, relationship, entity 2] or [entity, attribute, attribute value]. The knowledge graph can better organize, manage and understand the massive information on the Internet, and can transform the information on the Internet into a form that is closer to the human cognitive world. Therefore, it is widely used in intelligent retrieval, intelligent question answering, personalized recommendation and other fields.

在基于知识图谱的智能检索和智能问答系统中，对于非结构化数据来说，其基本的任务有：知识图谱的构建、信息抽取(实体抽取、关系抽取、属性抽取)、实体链接、查询构建等。现有已开源的最大的中文知识图谱规模达1.4亿，并且知识图谱构建技术已经非常成熟，可以很快便能构建所需的知识图谱。信息抽取(Information Extraction,IE)是从自然语言文本中抽取实体、属性、关系及事件等事实类信息的文本处理技术，在工业应用的需要以及学术竞赛的推动下，中文的信息抽取也取得一定的成果。然而由于文本语义的复杂性，使得实体链接和查询构建难度增大。多义词、同形词、指示代词等的存在，使得同一实体名称，具有歧义性，对应不同具体实体。如实体“苹果”，在没有上下文的条件下，可指“苹果(水果)”和“苹果(科技公司)”。如何区分和正确连接到实体，是实体链接和查询的主要研究问题。In the intelligent retrieval and intelligent question answering system based on knowledge graph, for unstructured data, the basic tasks are: knowledge graph construction, information extraction (entity extraction, relation extraction, attribute extraction), entity linking, query construction Wait. The largest Chinese knowledge graph that has been open sourced has a scale of 140 million, and the knowledge graph construction technology is very mature, and the required knowledge graph can be constructed soon. Information Extraction (IE) is a text processing technology that extracts factual information such as entities, attributes, relationships, and events from natural language texts. With the needs of industrial applications and the promotion of academic competitions, Chinese information extraction has also achieved certain results. 's results. However, due to the complexity of text semantics, entity linking and query construction are more difficult. The existence of polysemous words, homographs, demonstrative pronouns, etc., makes the name of the same entity ambiguous and corresponds to different specific entities. For example, the entity "apple" can refer to "apple (fruit)" and "apple (tech company)" without context. How to distinguish and correctly connect to entities is the main research problem of entity linking and querying.

现有实体链接方法主要包括两个步骤，即实体指称识别和实体消歧。指称识别首先要构建指称-实体字典，如抽取百科的实体页面、消歧页面以及重定向页面的标题作为实体指称，从而建立指称-实体字典。构建这一字典需要一定的时间和物力成本，构建成本随数据规模急剧增加。实体消歧则主要采用排序学习、图模型等方法，这类方法依赖于大量的标注数据。此外，利用SPARQL和Cypher等查询语言对抽取的实体-关系进行查询以及实体-关系的相似度匹配也是目前比较常见的用来链接知识库的方法，但这样会面临查询结果有多种可能性的问题。例如在问答系统中输入“谁主演了《喜剧之王》？是她的出道作品”利用实体关系抽取能得到[喜剧之王，主演，？],其中‘？’表示占位符。利用上述查询语言在知识库中查询可能得到:[喜剧之王，主演，周星驰],[喜剧之王，主演，莫文蔚]，[喜剧之王，主演，张柏芝]等结果，相似度匹配同理。由于知识库中通常只存储三元组信息，机器不能利用“出道作品”这一下文信息进行约束，因此难以得到[喜剧之王，主演，张柏芝]这一正确结果。Existing entity linking methods mainly include two steps, namely entity referential recognition and entity disambiguation. Referential recognition first needs to build a referent-entity dictionary, such as extracting the title of the entity page, disambiguation page and redirecting page of the encyclopedia as entity referents, so as to establish a referent-entity dictionary. It takes a certain amount of time and material cost to construct this dictionary, and the construction cost increases sharply with the size of the data. Entity disambiguation mainly adopts methods such as sorting learning and graph models, which rely on a large amount of labeled data. In addition, using query languages such as SPARQL and Cypher to query the extracted entity-relationship and entity-relationship similarity matching is also a relatively common method for linking knowledge bases, but this will face the query results. There are many possibilities question. For example, entering "Who starred in "The King of Comedy"? Is her debut work in the question-and-answer system? Using entity relationship extraction, you can get [The King of Comedy, starring,? ],in'? ' means a placeholder. Using the above query language to query in the knowledge base, you may get: [King of Comedy, starring, Stephen Chow], [King of Comedy, starring, Karen Mok], [King of Comedy, starring, Cecilia Cheung] and other results, the similarity matching is the same. Since only triple information is usually stored in the knowledge base, the machine cannot use the following information of "debut work" to be constrained, so it is difficult to obtain the correct result of [The King of Comedy, starring, Cecilia Cheung].

发明内容SUMMARY OF THE INVENTION

有鉴于此，本发明实施例提供一种成本低且效率高的，融合上下文语义约束的知识匹配方法、装置、设备及介质。In view of this, embodiments of the present invention provide a low-cost and high-efficiency knowledge matching method, apparatus, device, and medium that integrate context semantic constraints.

本发明的一方面提供了融合上下文语义约束的知识匹配方法，包括：One aspect of the present invention provides a knowledge matching method incorporating contextual semantic constraints, including:

对获取的输入语句进行知识抽取，得到所述输入语句中的实体向量和关系向量；Performing knowledge extraction on the obtained input sentence to obtain the entity vector and the relation vector in the input sentence;

将所述输入语句转换为文本向量；converting the input sentence into a text vector;

对所述实体向量、关系向量和所述文本向量进行拼接，得到目标向量；Splicing the entity vector, the relation vector and the text vector to obtain a target vector;

将所述目标向量与知识库中的向量记录进行相似度匹配，确定相似向量集合；Carry out similarity matching between the target vector and the vector record in the knowledge base, and determine a set of similar vectors;

将所述相似向量集合中相似度最高的向量作为匹配结果。The vector with the highest similarity in the similarity vector set is used as the matching result.

优选地，所述方法还包括知识库构建的步骤，该步骤包括：Preferably, the method further comprises the step of building a knowledge base, the step comprising:

根据原始训练文本训练词向量；Train word vectors according to the original training text;

学习知识库中各个训练文本的实体和关系的向量表示，得到实体训练向量和关系训练向量；Learning the vector representation of entities and relationships of each training text in the knowledge base, and obtaining entity training vectors and relationship training vectors;

对原始训练文本进行分词处理，并根据训练好的所述词向量对所述分词处理的结果进行向量表示，得到第一向量；Perform word segmentation processing on the original training text, and perform vector representation on the result of the word segmentation processing according to the trained word vector to obtain a first vector;

根据所述实体训练向量、所述关系训练向量和所述第一向量拼接处理，完成知识库构建。According to the splicing process of the entity training vector, the relation training vector and the first vector, the knowledge base construction is completed.

优选地，所述根据原始训练文本训练词向量这一步骤之前，还包括：Preferably, before the step of training the word vector according to the original training text, the method further includes:

对原始训练文本进行预处理，其中，所述预处理包括去停用词处理和分词处理；Preprocessing the original training text, wherein the preprocessing includes removing stop words and word segmentation;

所述去停用词处理包括：将所述原始训练文本中的无用词删除，所述无用词包括以下至少之一：语气助词、副词、介词、连接词；The process of removing stop words includes: deleting useless words in the original training text, where the useless words include at least one of the following: modal particles, adverbs, prepositions, and conjunctions;

所述分词处理包括：通过jieba分词工具将文本转换成用空格分隔的词序列。The word segmentation processing includes: converting the text into a word sequence separated by spaces by using the jieba word segmentation tool.

优选地，所述根据训练好的所述词向量对所述分词处理的结果进行向量表示，得到第一向量，包括：Preferably, the result of the word segmentation processing is represented by a vector according to the trained word vector to obtain a first vector, including:

将所述训练好的词向量进行逐一相加求平均后，得到所述第一向量。After the trained word vectors are added one by one and averaged, the first vector is obtained.

优选地，所述根据所述实体训练向量、所述关系训练向量和所述第一向量拼接处理，完成知识库构建，包括：Preferably, according to the splicing process of the entity training vector, the relationship training vector and the first vector, the knowledge base construction is completed, including:

根据所述实体和所述关系构建第一三元组，并根据所述实体的属性构建第二三元组，所述第一三元组为[第一实体，关系，第二实体]，所述第二三元组为[实体，属性，属性值]；A first triplet is constructed according to the entity and the relationship, and a second triplet is constructed according to the attributes of the entity, and the first triplet is [first entity, relationship, second entity], so The second triple is [entity, attribute, attribute value];

将所述第一三元组和所述第二三元组统一表示为所述第一向量。The first triplet and the second triplet are collectively represented as the first vector.

优选地，所述目标向量的表达形式为：p＝(e_p,1,r_p,0,t_p)；Preferably, the expression form of the target vector is: p=(e _p,1 ,r _p ,0,t _p );

其中，p为所述目标向量；e_p,1为实体向量；r_p为关系向量；t_p为文本向量；0为与e_p,1,r_p同维数的零向量，用以作为第二实体的占位向量。Among them, _p is the target vector; e _p _,1 is the entity vector; r _p is the relation vector; t _p is the text vector; The placeholder vector for two entities.

优选地，所述将所述目标向量与知识库中的向量记录进行相似度匹配，确定相似向量集合，包括：Preferably, the similarity matching between the target vector and the vector records in the knowledge base is performed to determine a set of similar vectors, including:

通过改进的余弦相似度计算公式计算所述目标向量与知识库中的向量记录之间的相似度，并根据计算得到的相似度确定相似向量集合；Calculate the similarity between the target vector and the vector records in the knowledge base by using the improved cosine similarity calculation formula, and determine the similarity vector set according to the calculated similarity;

其中，所述余弦相似度计算公式为：Wherein, the cosine similarity calculation formula is:

其中，

e_s,1代表知识库的第一实体向量；r_s代表知识库的关系向量；e_s,2代表知识库的第二实体向量；t_s代表知识库的文本语义向量；β代表权重；

e_p,1代表待匹配的第一实体向量；r_p代表待匹配的关系向量；t_p代表待匹配的文本语义向量。in,

es _,1 represents the first entity vector of the knowledge base; _rs represents the relation vector of the knowledge base; es _,2 represents the second entity vector of the knowledge base; _ts represents the text semantic vector of the knowledge base; β represents the weight;

ep _,1 represents the first entity vector to be matched; r _p represents the relation vector to be matched; t _p represents the text semantic vector to be matched.

本发明实施例另一方面还提供了一种融合上下文语义约束的知识匹配装置，包括：Another aspect of the embodiments of the present invention also provides a knowledge matching apparatus integrating context semantic constraints, including:

知识抽取模块，用于对获取的输入语句进行知识抽取，得到所述输入语句中的实体向量和关系向量；a knowledge extraction module, used for performing knowledge extraction on the acquired input sentence to obtain the entity vector and the relation vector in the input sentence;

转换模块，用于将所述输入语句转换为文本向量；a conversion module for converting the input sentence into a text vector;

拼接模块，用于对所述实体向量、关系向量和所述文本向量进行拼接，得到目标向量；a splicing module for splicing the entity vector, the relation vector and the text vector to obtain a target vector;

相似度匹配模块，用于将所述目标向量与知识库中的向量记录进行相似度匹配，确定相似向量集合；a similarity matching module, used for similarity matching between the target vector and the vector record in the knowledge base, to determine a set of similar vectors;

确定模块，用于将所述相似向量集合中相似度最高的向量作为匹配结果。The determining module is configured to use the vector with the highest similarity in the similarity vector set as the matching result.

本发明实施例另一方面还提供了一种电子设备，包括处理器以及存储器；Another aspect of the embodiments of the present invention further provides an electronic device, including a processor and a memory;

所述存储器用于存储程序；the memory is used to store programs;

所述处理器执行所述程序实现如前面所述的方法。The processor executes the program to implement the method as described above.

本发明实施例另一方面还提供了一种计算机可读存储介质，所述存储介质存储有程序，所述程序被处理器执行实现如前面所述的方法。Another aspect of the embodiments of the present invention further provides a computer-readable storage medium, where the storage medium stores a program, and the program is executed by a processor to implement the aforementioned method.

本发明实施例还公开了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括计算机指令，该计算机指令存储在计算机可读存储介质中。计算机设备的处理器可以从计算机可读存储介质读取该计算机指令，处理器执行该计算机指令，使得该计算机设备执行前面的方法。The embodiment of the present invention also discloses a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The computer instructions can be read from the computer-readable storage medium by a processor of the computer device, and the processor executes the computer instructions to cause the computer device to perform the foregoing method.

本发明的实施例对获取的输入语句进行知识抽取，得到所述输入语句中的实体向量和关系向量；将所述输入语句转换为文本向量；对所述实体向量、关系向量和所述文本向量进行拼接，得到目标向量；将所述目标向量与知识库中的向量记录进行相似度匹配，确定相似向量集合；将所述相似向量集合中相似度最高的向量作为匹配结果；本发明利用相似度匹配方法，无需如实体链接中需要额外数据作为支撑，免去处理和构建额外数据的步骤,提高了效率，减少了制作标注数据的时间成本和人力成本。The embodiment of the present invention performs knowledge extraction on the acquired input sentence to obtain entity vector and relation vector in the input sentence; converts the input sentence into a text vector; performs knowledge extraction on the entity vector, relation vector and the text vector splicing to obtain a target vector; similarity matching is performed between the target vector and the vector records in the knowledge base to determine a similar vector set; the vector with the highest similarity in the similar vector set is used as a matching result; the present invention utilizes the similarity The matching method does not require additional data as support in entity links, eliminates the steps of processing and constructing additional data, improves efficiency, and reduces the time cost and labor cost of producing labeled data.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

图1为本发明实施例的整体步骤流程图；Fig. 1 is the overall step flow chart of the embodiment of the present invention;

图2为本发明实施例提供的词向量示例图；FIG. 2 is an example diagram of a word vector provided by an embodiment of the present invention;

图3为本发明实施例提供的知识匹配示例图。FIG. 3 is an example diagram of knowledge matching provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

针对现有技术存在的问题，本发明实施例提供了一种融合上下文语义约束的知识匹配方法，包括：Aiming at the problems existing in the prior art, an embodiment of the present invention provides a knowledge matching method integrating context semantic constraints, including:

其中，

下面结合说明书附图，对本发明实施例的具体实施方式进行详细描述：The specific implementations of the embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings:

如图1所示，本发明包括构建知识库和知识匹配两个阶段。需要说明的是，本发明是在已有知识图谱以及构建该知识图谱的文本数据的前提下进行展开。As shown in FIG. 1 , the present invention includes two stages of building a knowledge base and knowledge matching. It should be noted that the present invention is developed on the premise of an existing knowledge graph and text data for constructing the knowledge graph.

(一)构建知识库：(1) Build a knowledge base:

步骤1)、训练词向量：利用用于构建知识图谱的文本数据(以下称源文本)，使用文本表示方法用以训练词向量。进一步的，需要对源文本进行去停用词、分词等预处理。所述去停用词是为了节省空间和提升处理效率，根据开源停用词表将文本中非实意词或没有实际语义信息的词删除，这些词通常是语气助词、副词、介词、连接词等，最终只保留关键词。所述分词是利用jieba等分词工具将文本转换成用空格分隔的词序列。Step 1), training word vectors: using the text data (hereinafter referred to as the source text) used to construct the knowledge graph, use the text representation method to train the word vectors. Further, the source text needs to be preprocessed such as removing stop words and word segmentation. The purpose of removing stop words is to save space and improve processing efficiency. According to the open source stop word table, non-meaning words or words without actual semantic information in the text are deleted. These words are usually modal particles, adverbs, prepositions, conjunctions, etc. , and only the keywords are retained in the end. The word segmentation is to use a word segmentation tool such as jieba to convert the text into a sequence of words separated by spaces.

具体地，本发明实施例对用来构建知识库的文本语料进行分词和清洗操作，所述的文本语料来源于由中国计算机学会、中国中文信息学会和百度公司联合举办“2019语言与智能技术竞赛”中的信息抽取任务的数据集。首选对该语料进行特殊处理，划分为两部分，一部分是三元组数据集，一部分是三元组所对应的源文本数据集。对源文本数据集进行去停用词后利用jieba等分词工具进行分词后，输入开源的Glove模型训练出300维的词向量。所得词向量表现形式如图2所示。Specifically, the embodiment of the present invention performs word segmentation and cleaning operations on the text corpus used to build the knowledge base, and the text corpus comes from the "2019 Language and Intelligence Technology Competition" jointly organized by the Chinese Computer Society, the Chinese Information Society of China and Baidu. "Datasets for Information Extraction Tasks. It is preferred to perform special processing on the corpus, which is divided into two parts, one is the triplet dataset, and the other is the source text dataset corresponding to the triplet. After removing stop words from the source text data set, use jieba and other word segmentation tools for word segmentation, and then input the open source Glove model to train a 300-dimensional word vector. The representation of the resulting word vector is shown in Figure 2.

步骤2)、训练知识图谱嵌入：类似词向量，利用知识图谱嵌入技术学习知识库中的实体和关系的向量表示，最终得到实体向量和关系向量。进一步的，此知识图谱嵌入不区分关系和属性，以及实体2和属性值，即将[实体1，关系，实体2]和[实体，属性，属性值]统一表示为[E1，R，E2]。Step 2), training knowledge graph embedding: similar to word vector, use knowledge graph embedding technology to learn the vector representation of entities and relationships in the knowledge base, and finally obtain entity vectors and relationship vectors. Further, this knowledge graph embedding does not distinguish between relations and attributes, as well as entity 2 and attribute value, that is, [entity 1, relation, entity 2] and [entity, attribute, attribute value] are uniformly expressed as [E1, R, E2].

具体地，本发明实施例利用开源的知识图谱嵌入Complex模型将步骤1)所得的三元组数据集转化为向量表示，假设有一三元组为[喜剧之王，主演，张柏芝]，向量化后为e_s,1，r,e_s,2其中e_s,1,r,e_s,2都为100维，该三元组表现形式类似词向量。Specifically, the embodiment of the present invention uses an open-source knowledge graph to embed the Complex model to convert the triple data set obtained in step 1) into a vector representation. Suppose a triple is [The King of Comedy, starring, Cecilia Cheung], vectorized The latter is es _,1 , r,es _,2 where es _,1 , r,es _,2 are all 100-dimensional, and the triplet is similar to the word vector.

步骤3)、求源文本向量：在对源文本分词并去停用词后的所有词，用训练好的词向量逐一相加求平均后的向量t用以表示整条源文本。Step 3), find the source text vector: after the word segmentation of the source text and the removal of stop words, add the trained word vectors one by one to obtain an average vector t to represent the entire source text.

具体地，本发明实施例假设源文本数据集中存在一条语句为“张柏芝的出道作品就是周星驰导演的《喜剧之王》”，由步骤1)得到词序列“张柏芝/出道/作品/周星驰/导演/喜剧之王”，将该词序列中每一词的词向量相加后取平均得到300维的向量t_s用以表示源文本。至此，所有的三元组和对应的源文本都进行了向量化。Specifically, the embodiment of the present invention assumes that there is a sentence in the source text data set as "Cecilia Cheung's debut work is "The King of Comedy" directed by Stephen Chow", and the word sequence "Cecilia Cheung / debut / work / Stephen Chow / director / is obtained from step 1). "King of Comedy", the word vector of each word in the word sequence is added and averaged to obtain a 300-dimensional vector t _s to represent the source text. So far, all triples and corresponding source texts have been vectorized.

步骤4)知识拼接：将知识库中的三元组即[E1，R，E2]的三个向量e_s,1,r,e_s,2拼接，并将该三元组的源文本的表示向量t_s一同拼接得到向量s＝(e_s,1,r_s,e_s,2,t_s)，即向量s由e_s,1,r,e_s,2,t_s四个向量沿行向量方向拼接，最终包含了三元组信息以及源文本语义信息。所述源文本的表示向量t_s由步骤3)得到。至此，整个知识库构建完成。Step 4) Knowledge splicing: splicing the triples in the knowledge base, namely the three vectors es _,1 ,r,es _,2 of [E1, R, E2], and splicing the representation of the source text of the triples The vectors _ts are spliced together to obtain the vector s=(es _,1 , _rs ,es _,2 , _ts ), that is, the vector _s is composed of four vectors es _,1 ,r,es _,2 ,ts along the line The vector direction splicing finally contains the triple information and the semantic information of the source text. The representation vector _ts of the source text is obtained in step 3). So far, the entire knowledge base has been constructed.

具体地，本发明实施例将步骤2)所得三元组向量和步骤3)所得源文本向量沿行向量方向进行拼接成为知识库中的一条记录。每条记录的维度则为600维，在本发明中，整个知识库总共有34万条记录，表示为s＝(e_s,1,r_s,e_s,2,t_s)。Specifically, in the embodiment of the present invention, the triplet vector obtained in step 2) and the source text vector obtained in step 3) are spliced along the row vector direction to form a record in the knowledge base. The dimension of each record is 600 dimensions. In the present invention, the entire knowledge base has a total of 340,000 records, expressed as s=(es _,1 , _rs ,es _,2 , _ts ).

(二)知识匹配阶段：(2) Knowledge matching stage:

步骤1)、对输入语句进行知识抽取，即抽取出语句中的实体和关系，并根据上述构建知识库阶段的步骤2)所得的知识图谱嵌入将实体转化为向量e_p,1表示，关系转化为向量r_p表示。需要说明的是，本发明的重心不在于知识抽取步骤，因此将对知识抽取相关技术不进行详细说明。Step 1), perform knowledge extraction on the input sentence, that is, extract the entity and relationship in the sentence, and convert the entity into a vector ep _{, 1} according to the knowledge map embedding obtained in step 2) of the above-mentioned construction knowledge base stage, and the relationship is transformed. is represented by a vector _rp . It should be noted that the focus of the present invention is not on the knowledge extraction step, so the knowledge extraction related technology will not be described in detail.

具体地，本发明实施例对输入问答系统的问题进行知识抽取，知识抽取模型可参考2019语言与智能技术竞赛公布的参赛模型。假设输入问题为“谁主演了《喜剧之王》，喜剧之王是她的出道作品”，则经过知识抽取模型可得[喜剧之王,主演,？]，其中？为占位符，则根据知识库构建阶段步骤2)训练好的知识图谱嵌入，将实体“喜剧之王”和关系“主演”分别向量化为e_p,1,r_p，？用同维度的零向量表示。Specifically, the embodiments of the present invention perform knowledge extraction on the questions input to the question answering system, and the knowledge extraction model may refer to the entry model published in the 2019 Language and Intelligence Technology Competition. Suppose the input question is "Who starred in "The King of Comedy", and the King of Comedy is her debut work", then through the knowledge extraction model, we can get [The King of Comedy, starring,? ],in? is a placeholder, then according to the knowledge graph embedding trained in step 2) of the knowledge base construction phase, the entity "King of Comedy" and the relationship "starring" are vectorized as ep _,1 , _rp , ? It is represented by a zero vector of the same dimension.

步骤2)、将输入语句转换为文本向量t_p，所采用方法为上述构建知识库阶段的步骤3)。In step 2), the input sentence is converted into a text vector t _p , and the adopted method is step 3) in the above-mentioned stage of building a knowledge base.

具体地，本发明实施例将输入语句转换为文本向量t_p：同知识库构建步骤3)，“谁主演了《喜剧之王》，喜剧之王是她的出道作品”处理后为“谁/主演/喜剧之王/喜剧之王/她/出道/作品”，该语句向量化后为t_p。Specifically, in the embodiment of the present invention, the input sentence is converted into a text vector _tp : the same as the knowledge base construction step 3). starring/king of comedy/king of comedy/she/debut/work", the sentence is vectorized as _tp .

步骤3)、将步骤1)所得的实体向量e_p,1、关系向量r_p和步骤2)所得的文本向量t_p拼接成p＝(e_p,1,r_p,0,t_p)，、其中，0为与e_p,1,r_p同维数的零向量，用以作为实体2的占位向量。Step 3), splicing the entity vector ep _,1 obtained in step 1), the relation vector r _p and the text vector t _p obtained in step 2) into p=(e _p,1 ,r _p ,0,t _p ), , where 0 is a zero vector of the same dimension as e _p,1 , r _p , which is used as the placeholder vector of entity 2.

具体地，本发明实施例的拼接向量p＝(e_p,1,r_p,0，t_p)，由e_p,1,r_p,0,t_p四个向量沿行向量方向拼接。Specifically, the splicing vector _p =(ep _,1 , rp , 0, t _p ) in the embodiment of the present invention is spliced along the row vector direction by four vectors _ep _{, 1} , rp , 0, t _p .

步骤4)、利用改进的余弦相似度将步骤3)得到的向量同知识库中的记录进行相似度匹配。所述改进的余弦相似度表示为：Step 4), using the improved cosine similarity to perform similarity matching between the vector obtained in step 3) and the record in the knowledge base. The improved cosine similarity is expressed as:

其中，

是由向量s在源文本向量t_s乘以一个权重系数β所得。in,

is obtained by multiplying the vector s by a weight coefficient β in the source text vector t _s .

是由向量p在源文本向量t_p乘以一个权重系数β所得。

is obtained by multiplying the vector p by a weight coefficient β in the source text vector t _p .

具体地，本发明实施例将p与知识库s中的每条记录进行相似度匹配，匹配的结果如图3所示。该相似度算法为改进的余弦相似度，计算式为：Specifically, the embodiment of the present invention performs similarity matching between p and each record in the knowledge base s, and the matching result is shown in FIG. 3 . The similarity algorithm is an improved cosine similarity, and the calculation formula is:

其中，

是由向量s在源文本向量t_s乘以一个权重系数β所得。in,

是由向量p在源文本向量t_p乘以一个权重系数β所得。

该权重系数β在本发明中取1.8为最优。The weight coefficient β takes 1.8 as the optimum in the present invention.

步骤5)、对步骤4)所求的结果进行排序，取相似度最高的为最终知识匹配结果。Step 5), sort the results obtained in step 4), and take the one with the highest similarity as the final knowledge matching result.

与现有技术相比，本发明的有益效果在于；Compared with the prior art, the beneficial effects of the present invention are:

(1)利用相似度匹配方法，无需如实体链接中需要额外数据作为支撑，免去处理和构建额外数据的步骤,减少了制作标注数据的时间成本和人力成本。进一步的，对比利用查询语言的方法，本发明所得知识匹配的结果不具有歧义性和多选性。(1) Using the similarity matching method, there is no need for additional data as support in the entity link, which eliminates the steps of processing and constructing additional data, and reduces the time cost and labor cost of producing labeled data. Further, compared with the method using query language, the knowledge matching result obtained by the present invention has no ambiguity and multiple selection.

(2)区别于已有发明，引入上下文信息的约束，使得知识匹配的输出趋向于唯一性，并可动态调整三元组相似度和文本相似度的权重，匹配的结果表现更为良好。(2) Different from the existing invention, the constraints of context information are introduced, so that the output of knowledge matching tends to be unique, and the weights of triple similarity and text similarity can be dynamically adjusted, and the matching results perform better.

(3)本发明可应用于不用领域，区别在于研究规划的具体步骤，具有较大的通用性。(3) The present invention can be applied to different fields, the difference lies in the specific steps of research planning, and it has great versatility.

本发明实施例还提供了一种融合上下文语义约束的知识匹配装置，包括：The embodiment of the present invention also provides a knowledge matching device integrating context semantic constraints, including:

所述存储器用于存储程序；the memory is used to store programs;

在一些可选择的实施例中，在方框图中提到的功能/操作可以不按照操作示图提到的顺序发生。例如，取决于所涉及的功能/操作，连续示出的两个方框实际上可以被大体上同时地执行或所述方框有时能以相反顺序被执行。此外，在本发明的流程图中所呈现和描述的实施例以示例的方式被提供，目的在于提供对技术更全面的理解。所公开的方法不限于本文所呈现的操作和逻辑流程。可选择的实施例是可预期的，其中各种操作的顺序被改变以及其中被描述为较大操作的一部分的子操作被独立地执行。In some alternative implementations, the functions/operations noted in the block diagrams may occur out of the order noted in the operational diagrams. For example, two blocks shown in succession may, in fact, be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/operations involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more comprehensive understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of the various operations are altered and in which sub-operations described as part of larger operations are performed independently.

此外，虽然在功能性模块的背景下描述了本发明，但应当理解的是，除非另有相反说明，所述的功能和/或特征中的一个或多个可以被集成在单个物理装置和/或软件模块中，或者一个或多个功能和/或特征可以在单独的物理装置或软件模块中被实现。还可以理解的是，有关每个模块的实际实现的详细讨论对于理解本发明是不必要的。更确切地说，考虑到在本文中公开的装置中各种功能模块的属性、功能和内部关系的情况下，在工程师的常规技术内将会了解该模块的实际实现。因此，本领域技术人员运用普通技术就能够在无需过度试验的情况下实现在权利要求书中所阐明的本发明。还可以理解的是，所公开的特定概念仅仅是说明性的，并不意在限制本发明的范围，本发明的范围由所附权利要求书及其等同方案的全部范围来决定。Furthermore, while the invention is described in the context of functional modules, it is to be understood that, unless stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or or software modules, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to understand the present invention. Rather, given the attributes, functions, and internal relationships of the various functional modules in the apparatus disclosed herein, the actual implementation of the modules will be within the routine skill of the engineer. Accordingly, those skilled in the art, using ordinary skill, can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are illustrative only and are not intended to limit the scope of the invention, which is to be determined by the appended claims along with their full scope of equivalents.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，例如，可以被认为是用于实现逻辑功能的可执行指令的定序列表，可以具体实现在任何计算机可读介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。就本说明书而言，“计算机可读介质”可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。The logic and/or steps represented in flowcharts or otherwise described herein, for example, may be considered an ordered listing of executable instructions for implementing the logical functions, may be embodied in any computer-readable medium, For use with, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a system including a processor, or other system that can fetch instructions from and execute instructions from an instruction execution system, apparatus, or apparatus) or equipment. For the purposes of this specification, a "computer-readable medium" can be any device that can contain, store, communicate, propagate, or transport the program for use by or in connection with an instruction execution system, apparatus, or apparatus.

计算机可读介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式光盘只读存储器(CDROM)。另外，计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在计算机存储器中。More specific examples (non-exhaustive list) of computer readable media include the following: electrical connections with one or more wiring (electronic devices), portable computer disk cartridges (magnetic devices), random access memory (RAM), Read Only Memory (ROM), Erasable Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program may be printed, as the paper or other medium may be optically scanned, for example, followed by editing, interpretation, or other suitable medium as necessary process to obtain the program electronically and then store it in computer memory.

应当理解，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of the present invention may be implemented in hardware, software, firmware or a combination thereof. In the above-described embodiments, various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, Programmable Gate Arrays (PGA), Field Programmable Gate Arrays (FPGA), etc.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, description with reference to the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples", etc., mean specific features described in connection with the embodiment or example , structure, material or feature is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同物限定。Although embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, The scope of the invention is defined by the claims and their equivalents.

以上是对本发明的较佳实施进行了具体说明，但本发明并不限于所述实施例，熟悉本领域的技术人员在不违背本发明精神的前提下还可做出种种的等同变形或替换，这些等同的变形或替换均包含在本申请权利要求所限定的范围内。The above is a specific description of the preferred implementation of the present invention, but the present invention is not limited to the described embodiments, and those skilled in the art can also make various equivalent deformations or replacements on the premise of not violating the spirit of the present invention, These equivalent modifications or substitutions are all included within the scope defined by the claims of the present application.

Claims

1. A knowledge matching method incorporating contextual semantic constraints is characterized in that, comprising:

Performing knowledge extraction on the obtained input sentence to obtain the entity vector and the relation vector in the input sentence;

converting the input sentence into a text vector;

Splicing the entity vector, the relation vector and the text vector to obtain a target vector;

Carry out similarity matching between the target vector and the vector record in the knowledge base, and determine a set of similar vectors;

The vector with the highest similarity in the similarity vector set is used as the matching result.

2. The knowledge matching method of merging context semantic constraints according to claim 1, wherein the method further comprises the step of building a knowledge base, the step comprising:

Train word vectors according to the original training text;

Learning the vector representation of entities and relationships of each training text in the knowledge base, and obtaining entity training vectors and relationship training vectors;

Perform word segmentation processing on the original training text, and perform vector representation on the result of the word segmentation processing according to the trained word vector to obtain a first vector;

According to the splicing process of the entity training vector, the relation training vector and the first vector, the knowledge base construction is completed.

3. the knowledge matching method of fusion context semantic constraint according to claim 2, is characterized in that, before the described step of training word vector according to original training text, also comprises:

Preprocessing the original training text, wherein the preprocessing includes removing stop words and word segmentation;

The process of removing stop words includes: deleting useless words in the original training text, where the useless words include at least one of the following: modal particles, adverbs, prepositions, and conjunctions;

The word segmentation processing includes: converting the text into a word sequence separated by spaces by using the jieba word segmentation tool.

4. The knowledge matching method of merging context semantic constraints according to claim 2, wherein the result of the word segmentation process is represented by a vector according to the trained word vector to obtain a first vector, comprising:

After the trained word vectors are added one by one and averaged, the first vector is obtained.

5. The knowledge matching method of merging context semantic constraints according to claim 2, wherein, according to the entity training vector, the relationship training vector and the first vector splicing processing, the knowledge base construction is completed, include:

A first triplet is constructed according to the entity and the relationship, and a second triplet is constructed according to the attributes of the entity, and the first triplet is [first entity, relationship, second entity], so The second triple is [entity, attribute, attribute value];

The first triplet and the second triplet are collectively represented as the first vector.

6. The knowledge matching method of integrating context semantic constraints according to claim 1, wherein the expression form of the target vector is: p=(e _p,1 ,r _p ,0,t _p );

Among them, _p is the target vector; e _p _,1 is the entity vector; r _p is the relation vector; t _p is the text vector; The placeholder vector for two entities.

7. The knowledge matching method of fused context semantic constraints according to claim 1, wherein the described target vector and the vector record in the knowledge base are carried out similarity matching, and the set of similar vectors is determined, comprising:

Calculate the similarity between the target vector and the vector records in the knowledge base by using the improved cosine similarity calculation formula, and determine the similarity vector set according to the calculated similarity;

Wherein, the cosine similarity calculation formula is:

in,

8. A knowledge matching device fused with context semantic constraints, characterized in that, comprising:

a knowledge extraction module, used for performing knowledge extraction on the acquired input sentence to obtain the entity vector and the relation vector in the input sentence;

a conversion module for converting the input sentence into a text vector;

a splicing module for splicing the entity vector, the relation vector and the text vector to obtain a target vector;

a similarity matching module, used for similarity matching between the target vector and the vector record in the knowledge base, to determine a set of similar vectors;

The determining module is configured to use the vector with the highest similarity in the similarity vector set as the matching result.

9. An electronic device, comprising a processor and a memory;

the memory is used to store programs;

The processor executes the program to implement the method according to any one of claims 1-7.

10. A computer-readable storage medium, wherein the storage medium stores a program, and the program is executed by a processor to implement the method according to any one of claims 1-7.