CN116910189A - Atlas question-answering method, apparatus, system and medium - Google Patents
Atlas question-answering method, apparatus, system and medium Download PDFInfo
- Publication number
- CN116910189A CN116910189A CN202310217398.3A CN202310217398A CN116910189A CN 116910189 A CN116910189 A CN 116910189A CN 202310217398 A CN202310217398 A CN 202310217398A CN 116910189 A CN116910189 A CN 116910189A
- Authority
- CN
- China
- Prior art keywords
- vector
- question
- candidate
- similarity
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供一种图谱问答方法、装置、系统及介质,所述方法包括:计算问题向量与候选向量之间的相似度,以在所述候选向量中确定目标向量,其中,所述问题向量为基于用户问题进行向量转换得到的向量,所述候选向量为基于切割后的三元组集合进行向量转换得到的向量集合,所述目标向量与所述问题向量的相似度大于所述候选向量中其他候选向量与所述问题向量的相似度;根据所述目标向量对应的三元组回答所述用户问题。这样,在计算相似度时减少了三元组中答案部分对问答的干扰,提升了图谱问答系统问答用户问题的准确率。
The present invention provides a graph question and answer method, device, system and medium. The method includes: calculating the similarity between a question vector and a candidate vector to determine a target vector in the candidate vector, wherein the question vector is A vector obtained by vector conversion based on user questions, the candidate vector is a vector set obtained by vector conversion based on a cut triple set, and the similarity between the target vector and the question vector is greater than other candidate vectors. The similarity between the candidate vector and the question vector; answering the user question according to the triplet corresponding to the target vector. In this way, when calculating the similarity, the interference of the answer part in the triplet on the question and answer is reduced, and the accuracy of the graph question and answer system in answering user questions is improved.
Description
技术领域Technical field
本发明涉及自然语言处理技术领域,尤其涉及一种图谱问答方法、装置、系统及介质。The present invention relates to the technical field of natural language processing, and in particular to a graph question and answer method, device, system and medium.
背景技术Background technique
在智能问答中通常运用了基于语义分析(Semantic Parser,SP)类的方法和基于信息检索(Information Retrieval,IR)类的方法。In intelligent question answering, methods based on semantic analysis (Semantic Parser, SP) and methods based on information retrieval (Information Retrieval, IR) are usually used.
然而,基于语义分析类的方法依赖于预先定义的规则模板,需要大量的手工标注数据,在没有训练数据的情况下问答准确率低;基于信息检索类的方法则需要句法分析和抽取实体,其中,现有预训练的实体抽取模型只支持抽取固定几个类型的实体,如人名、地名、场所、日期等,在小样本垂直领域的知识上抽取效果不佳。However, methods based on semantic analysis rely on predefined rule templates, require a large amount of manual annotation data, and have low question and answer accuracy without training data; methods based on information retrieval require syntactic analysis and entity extraction, among which , Existing pre-trained entity extraction models only support the extraction of a few fixed types of entities, such as names of people, place names, places, dates, etc., and the extraction effect is not good in small samples of vertical domain knowledge.
可见,现有技术中图谱问答系统存在问答准确率较低的问题。It can be seen that the graph question answering system in the existing technology has a problem of low question answering accuracy.
发明内容Contents of the invention
本发明实施例提供一种图谱问答方法、装置、系统及介质,以解决现有技术中问答准确率较低的问题。Embodiments of the present invention provide a graph question and answer method, device, system and medium to solve the problem of low question and answer accuracy in the prior art.
第一方面,本发明实施例提供了一种图谱问答方法,所述方法包括:In a first aspect, embodiments of the present invention provide a graph question and answer method, which method includes:
计算问题向量与候选向量之间的相似度,以在所述候选向量中确定目标向量,其中,所述问题向量为基于用户问题进行向量转换得到的向量,所述候选向量为基于切割后的三元组集合进行向量转换得到的向量集合,所述目标向量与所述问题向量的相似度大于所述候选向量中其他候选向量与所述问题向量的相似度;Calculate the similarity between the question vector and the candidate vector to determine the target vector in the candidate vector, wherein the question vector is a vector obtained by vector conversion based on the user question, and the candidate vector is based on the cut three-dimensional vector. A set of vectors obtained by vector conversion of a tuple set, where the similarity between the target vector and the problem vector is greater than the similarity between other candidate vectors in the candidate vectors and the problem vector;
根据所述目标向量对应的三元组回答所述用户问题。Answer the user question according to the triplet corresponding to the target vector.
可选地,在所述计算问题向量与候选向量之间的相似度之前,所述方法还包括:Optionally, before calculating the similarity between the question vector and the candidate vector, the method further includes:
基于用户图谱数据构建知识图谱,得到三元组集合,所述三元组集合中每个三元组包括三元素;Construct a knowledge graph based on the user graph data to obtain a triplet set, where each triplet in the triplet set includes three elements;
将所述三元组集合中每个三元组切割为三个二元组,得到切割后的三元组集合,所述切割后的三元组集合中每个二元组包括二元素;Cut each triplet in the triplet set into three tuples to obtain a cut triplet set, and each triplet in the cut triplet set includes two elements;
将所述切割后的三元组集合进行向量转换,得到所述候选向量。The cut triple set is vector converted to obtain the candidate vector.
可选地,所述将所述切割后的三元组集合进行向量转换,得到所述候选向量,包括:Optionally, performing vector conversion on the cut triple set to obtain the candidate vector includes:
将所述切割后的三元组集合进行向量转换后进行聚类,得到索引向量,每个索引向量为其所包括的候选向量的聚类中心,所述候选向量的数量大于所述索引向量的数量;The cut triple set is vector converted and then clustered to obtain index vectors. Each index vector is the clustering center of the candidate vectors it contains. The number of candidate vectors is greater than the number of the index vectors. quantity;
所述计算问题向量与候选向量之间的相似度,以在所述候选向量中确定目标向量,包括:The calculation of the similarity between the question vector and the candidate vector to determine the target vector in the candidate vector includes:
计算问题向量与索引向量之间的相似度,以在所述索引向量中确定目标索引向量,所述目标索引向量与所述问题向量的相似度大于所述索引向量中其他索引向量与所述问题向量的相似度;Calculate the similarity between the question vector and the index vector to determine a target index vector in the index vector, the similarity between the target index vector and the question vector is greater than that of other index vectors in the index vector and the question Similarity of vectors;
在所述目标索引向量所包括的候选向量中,计算问题向量与候选向量之间的相似度,确定目标向量。Among the candidate vectors included in the target index vector, the similarity between the question vector and the candidate vector is calculated to determine the target vector.
可选地,在所述计算问题向量与候选向量之间的相似度之前,所述将所述切割后的三元组集合进行向量转换,得到所述候选向量之后,所述方法还包括:Optionally, before calculating the similarity between the question vector and the candidate vector, performing vector conversion on the cut triple set, and after obtaining the candidate vector, the method further includes:
将所述候选向量预先存储在本地或数据库,所述数据库与计算设备相关联,所述计算设备用于计算问题向量与候选向量之间的相似度。The candidate vectors are pre-stored locally or in a database, the database is associated with a computing device, and the computing device is used to calculate the similarity between the question vector and the candidate vector.
可选地,在所述计算问题向量与候选向量之间的相似度之前,所述方法还包括:Optionally, before calculating the similarity between the question vector and the candidate vector, the method further includes:
训练向量转换模型,所述向量转换模型用于基于用户问题进行向量转换得到问题向量,以及基于切割后的三元组集合进行向量转换得到候选向量;Train a vector conversion model, which is used to perform vector conversion based on user questions to obtain question vectors, and perform vector conversion based on the cut triple set to obtain candidate vectors;
其中,所述向量转换模型为预训练语言模型RoFormer-Sim,所述RoFormer-Sim中的归一化算法更改为均方根值RMS算法。Wherein, the vector conversion model is a pre-trained language model RoFormer-Sim, and the normalization algorithm in the RoFormer-Sim is changed to the root mean square value RMS algorithm.
第二方面,本发明实施例提供了一种图谱问答装置,包括:In a second aspect, embodiments of the present invention provide a graph question and answer device, including:
计算模块,用于计算问题向量与候选向量之间的相似度,以在所述候选向量中确定目标向量,其中,所述问题向量为基于用户问题进行向量转换得到的向量,所述候选向量为基于切割后的三元组集合进行向量转换得到的向量集合,所述目标向量与所述问题向量的相似度大于所述候选向量中其他候选向量与所述问题向量的相似度;The calculation module is used to calculate the similarity between the question vector and the candidate vector to determine the target vector in the candidate vector, wherein the question vector is a vector obtained by vector conversion based on the user question, and the candidate vector is A vector set obtained by performing vector conversion based on the cut triple set, where the similarity between the target vector and the problem vector is greater than the similarity between other candidate vectors in the candidate vectors and the problem vector;
回答模块,用于根据所述目标向量对应的三元组回答所述用户问题。An answering module is used to answer the user question according to the triplet corresponding to the target vector.
可选地,所述装置还包括:Optionally, the device also includes:
构建模块,用于基于用户图谱数据构建知识图谱,得到三元组集合,所述三元组集合中每个三元组包括三元素;A building module, used to construct a knowledge graph based on user graph data to obtain a triplet set, where each triplet in the triplet set includes three elements;
切割模块,用于将所述三元组集合中每个三元组切割为三个二元组,得到切割后的三元组集合,所述切割后的三元组集合中每个二元组包括二元素;A cutting module, used to cut each triplet in the triplet set into three tuples to obtain a cut triplet set. Each triplet in the cut triplet set is Includes two elements;
转换模块,用于将所述切割后的三元组集合进行向量转换,得到所述候选向量。A conversion module, configured to perform vector conversion on the cut triple set to obtain the candidate vector.
可选地,所述转换模块包括:Optionally, the conversion module includes:
聚类子模块,用于将所述切割后的三元组集合进行向量转换后进行聚类,得到索引向量,每个索引向量为其所包括的候选向量的聚类中心,所述候选向量的数量大于所述索引向量的数量;The clustering submodule is used to perform vector conversion on the cut triple set and perform clustering to obtain index vectors. Each index vector is the clustering center of the candidate vectors it contains. The number is greater than the number of index vectors;
所述计算模块包括:The computing module includes:
第一计算子模块,用于计算问题向量与索引向量之间的相似度,以在所述索引向量中确定目标索引向量,所述目标索引向量与所述问题向量的相似度大于所述索引向量中其他索引向量与所述问题向量的相似度;The first calculation sub-module is used to calculate the similarity between the question vector and the index vector to determine the target index vector in the index vector, and the similarity between the target index vector and the question vector is greater than the index vector The similarity between other index vectors in and the problem vector;
第二计算子模块,用于在所述目标索引向量所包括的候选向量中,计算问题向量与候选向量之间的相似度,确定目标向量。The second calculation submodule is used to calculate the similarity between the question vector and the candidate vector among the candidate vectors included in the target index vector, and determine the target vector.
第三方面,本发明实施例提供了一种图谱问答系统,包括:In a third aspect, embodiments of the present invention provide a graph question and answer system, including:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如第一方面所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.
第四方面,本发明实施例提供了一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使所述计算机执行如第一方面所述的方法。In a fourth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium storing computer instructions, the computer instructions being used to cause the computer to execute the method described in the first aspect.
本发明实施例中,计算基于用户问题进行向量转换得到的问题向量与基于切割后的三元组集合进行向量转换得到的候选向量之间的相似度,以确定目标向量,然后根据目标向量对应的切割后的三元组中缺失部分或目标向量对应的三元组的拼接句作为用户问题的答案输出。在计算问题向量与候选向量之间的相似度之前,对三元组集合中每个三元组进行了切割,以得到切割后的三元组集合,这样,在计算相似度时减少了三元组中答案部分对问答的干扰,提升了图谱问答系统问答用户问题的准确率。In the embodiment of the present invention, the similarity between the question vector obtained by vector conversion based on the user's question and the candidate vector obtained by vector conversion based on the cut triple set is calculated to determine the target vector, and then based on the target vector corresponding The missing part in the cut triplet or the spliced sentence of the triplet corresponding to the target vector is output as the answer to the user's question. Before calculating the similarity between the question vector and the candidate vector, each triplet in the triplet set is cut to obtain the cut triplet set. In this way, the triplet set is reduced when calculating the similarity. The interference of the answer part in the group to the question and answer improves the accuracy of the graph question and answer system in answering user questions.
附图说明Description of the drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only some of the drawings of the present invention. Embodiments, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative efforts.
图1是本发明实施例提供的图谱问答方法的流程图之一;Figure 1 is one of the flow charts of the graph question and answer method provided by the embodiment of the present invention;
图2是本发明实施例提供的图谱问答方法的流程图之二;Figure 2 is the second flow chart of the graph question and answer method provided by the embodiment of the present invention;
图3是本发明实施例提供的图谱问答装置的结构图。Figure 3 is a structural diagram of a graph question and answer device provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.
参见图1,图1是本发明实施例提供的图谱问答方法的流程图之一,如图1所示,所述方法包括以下步骤:Referring to Figure 1, Figure 1 is one of the flow charts of the graph question and answer method provided by an embodiment of the present invention. As shown in Figure 1, the method includes the following steps:
步骤101、计算问题向量与候选向量之间的相似度,以在所述候选向量中确定目标向量,其中,所述问题向量为基于用户问题进行向量转换得到的向量,所述候选向量为基于切割后的三元组集合进行向量转换得到的向量集合,所述目标向量与所述问题向量的相似度大于所述候选向量中其他候选向量与所述问题向量的相似度;Step 101: Calculate the similarity between the question vector and the candidate vector to determine the target vector in the candidate vector, wherein the question vector is a vector obtained by vector conversion based on the user question, and the candidate vector is based on cutting A vector set obtained by performing vector conversion on the final set of triples, where the similarity between the target vector and the problem vector is greater than the similarity between other candidate vectors in the candidate vectors and the problem vector;
三元组(triple)集合中包括多个三元组,三元组可以是知识图谱中<头节点-关系-尾节点>或<节点-属性-属性值>形式的知识,可以将知识三元组组合成句子的形式。下文以<头节点-关系-尾节点>为例,进行示例性的说明,应当理解的是,<节点-属性-属性值>等形式的知识,同样可以达到相同的技术效果,在此不再赘述。The triple set includes multiple triples. The triples can be knowledge in the form of <head node-relationship-tail node> or <node-attribute-attribute value> in the knowledge graph. The knowledge triples can be Combine into sentence form. The following takes <head node-relationship-tail node> as an example for illustrative explanation. It should be understood that knowledge in the form of <node-attribute-attribute value> can also achieve the same technical effect, and will not be discussed here. Repeat.
在一示例中,头节点和尾节点可以是实体,即<头节点-关系-尾节点>可以表示为<实体-关系-实体>,例如,<张三-妻子-李四>。在问答过程中一般是基于句子进行交互的,因此可以用三条竖线分隔三元组(如:张三|||妻子|||李四),也可以使用其他符号或者用“的”,“是”等字符连接(如:张三的妻子是李四),也可以直接拼接三元组(如:张三妻子李四),这样,根据三元组得到的句子包括了三元组中的全部内容。In an example, the head node and the tail node can be entities, that is, <head node-relationship-tail node> can be expressed as <entity-relationship-entity>, for example, <Zhang San-wife-Li Si>. In the question and answer process, the interaction is generally based on sentences, so you can use three vertical lines to separate triples (such as: Zhang San | | | Wife | | | Li Si), you can also use other symbols or use "", " "Yes" and other characters are connected (for example: Zhang San's wife is Li Si), or triples can be directly spliced (for example: Zhang San's wife is Li Si), so that the sentence obtained based on the triples includes the three-tuples. all content.
在单跳图谱问答中,用户问题一般只包含头节点和关系,问尾节点;或只包含关系和尾节点,问头节点;又或者只包含头节点与尾节点,询问关系。例如,用户问题可以是“张三的妻子是?”,或者“张三和李四的关系是?”等。现有的图谱问答系统认为“张三和李四的孩子是?”相比“张三的妻子是?”与三元组<张三-妻子-李四>的拼接句“张三|||妻子|||李四”具有更高的相似度。由于,用户问题“张三和李四的孩子是?”提及了“张三”和“李四”两个实体,所以与“张三|||妻子|||李四”的相似度更高,更容易命中三元组<张三-妻子-李四>,导致现有的图谱问答系统问答准确率较低。实际上,三元组<张三-妻子-李四>的拼接句“张三|||妻子|||李四”需要与用户问题“张三的妻子是?”具有更高的相似度。因此,本发明中将三元组集合中每个三元组进行切割,得到切割后的三元组集合,以减少三元组中答案部分对问答的干扰。In single-hop graph Q&A, user questions generally only include head nodes and relationships and ask about tail nodes; or only include relationships and tail nodes and ask about head nodes; or they only include head nodes and tail nodes and ask about relationships. For example, the user question can be "Who is Zhang San's wife?", or "What is the relationship between Zhang San and Li Si?" etc. The existing graph question and answer system believes that "Who is the child of Zhang San and Li Si?" compared with "Who is Zhang San's wife?" and the spliced sentence of the triplet <Zhang San-wife-Li Si> "Zhang San||| Wife|||李思" has a higher similarity. Since the user question "Who is the child of Zhang San and Li Si?" mentions the two entities "Zhang San" and "Li Si", the similarity with "Zhang San|||wife|||Li Si" is even greater. High, it is easier to hit the triple <Zhang San-Wife-Li Si>, resulting in a lower question answering accuracy of the existing graph question and answer system. In fact, the spliced sentence "Zhang San|||wife|||Li Si" of the triplet <Zhang San-wife-Li Si> needs to have a higher degree of similarity with the user question "Who is Zhang San's wife?" Therefore, in the present invention, each triplet in the triplet set is cut to obtain a cut triplet set, so as to reduce the interference of the answer part in the triplet set to the question and answer.
其中,三元组的拼接句与用户问题之间的相似度可以是基于计算向量之间的距离实现的。通过将用户问题进行向量转换得到问题向量,将切割后的三元组集合进行向量转换得到候选向量;计算问题向量与候选向量之间的相似度,以在候选向量中确定与问题向量之间距离最短的目标向量,即候选向量中目标向量与问题向量具有较高的相似度,从而将用户问题命中目标向量对应的三元组。进一步通过步骤102回答用户问题。Among them, the similarity between the spliced sentence of triples and the user's question can be realized based on calculating the distance between vectors. The question vector is obtained by vector conversion of the user question, and the candidate vector is obtained by vector conversion of the cut triple set; the similarity between the question vector and the candidate vector is calculated to determine the distance from the question vector in the candidate vector The shortest target vector, that is, the target vector and the question vector in the candidate vectors have a high degree of similarity, so that the user question is hit to the triplet corresponding to the target vector. Further answer user questions through step 102.
步骤102、根据所述目标向量对应的三元组回答所述用户问题。Step 102: Answer the user question according to the triplet corresponding to the target vector.
候选向量为基于切割后的三元组集合进行向量转换得到的向量集合,换言之,在候选向量中确定目标向量为切割后的三元组集合进行向量转换得到的向量集合中的某一向量,根据目标向量可以确定其对应的切割后的三元组,将其对应的切割后的三元组中缺失部分作为用户问题的答案输出,或者还可以将目标向量对应的三元组的拼接句作为用户问题的答案输出。本发明提供的图谱问答方法可以应用在图谱问答系统中,以提升图谱问答系统问答用户问题的准确率。The candidate vector is a vector set obtained by vector conversion based on the cut triple set. In other words, the target vector is determined among the candidate vectors to be a certain vector in the vector set obtained by vector conversion of the cut triple set. According to The target vector can determine its corresponding cut triplet, and output the missing part of its corresponding cut triplet as the answer to the user's question, or it can also use the spliced sentence of the triplet corresponding to the target vector as the user The answer to the question is output. The graph question and answer method provided by the present invention can be applied in the graph question and answer system to improve the accuracy of the graph question and answer system in answering user questions.
在一示例中,三元组<张三-妻子-李四>可以切割为<张三-妻子>、<妻子-李四>、<张三-李四>,用户问题可以是“张三的妻子是?”,从而,根据计算问题向量与候选向量之间的相似度,确定的目标向量对应的切割后的三元组可以是<张三-妻子>。切割后的三元组<张三-妻子>中缺失部分为“李四”,因此可以将“李四”作为用户问题“张三的妻子是?”的答案输出;或者,将目标向量对应的三元组<张三-妻子-李四>的拼接句“张三的妻子是李四”作为用户问题“张三的妻子是?”的答案输出,提高了问答的准确率。In an example, the triplet <Zhang San-wife-Li Si> can be cut into <Zhang San-wife>, <wife-Li Si>, <Zhang San-Li Si>, and the user question can be "Zhang San's Who is the wife?", thus, based on the calculation of the similarity between the question vector and the candidate vector, the cut triplet corresponding to the determined target vector can be <Zhang San-wife>. The missing part in the cut triple <Zhang San-wife> is "Li Si", so "Li Si" can be output as the answer to the user's question "Who is Zhang San's wife?"; or, the target vector corresponding to The spliced sentence "Zhang San's wife is Li Si" of the triplet <Zhang San-wife-Li Si> is output as the answer to the user's question "Who is Zhang San's wife?", which improves the accuracy of question and answer.
本发明实施例中,计算基于用户问题进行向量转换得到的问题向量与基于切割后的三元组集合进行向量转换得到的候选向量之间的相似度,以确定目标向量,然后根据目标向量对应的切割后的三元组中缺失部分或目标向量对应的三元组的拼接句作为用户问题的答案输出。在计算问题向量与候选向量之间的相似度之前,对三元组集合中每个三元组进行了切割,以得到切割后的三元组集合,这样,在计算相似度时减少了三元组中答案部分对问答的干扰,提升了图谱问答系统问答用户问题的准确率。In the embodiment of the present invention, the similarity between the question vector obtained by vector conversion based on the user's question and the candidate vector obtained by vector conversion based on the cut triple set is calculated to determine the target vector, and then based on the target vector corresponding The missing part in the cut triplet or the spliced sentence of the triplet corresponding to the target vector is output as the answer to the user's question. Before calculating the similarity between the question vector and the candidate vector, each triplet in the triplet set is cut to obtain the cut triplet set. In this way, the triplet set is reduced when calculating the similarity. The interference of the answer part in the group to the question and answer improves the accuracy of the graph question and answer system in answering user questions.
在一些可选地实施方式中,在步骤101所述计算问题向量与候选向量之间的相似度之前,所述方法还包括:In some optional implementations, before calculating the similarity between the question vector and the candidate vector in step 101, the method further includes:
基于用户图谱数据构建知识图谱,得到三元组集合,所述三元组集合中每个三元组包括三元素;Construct a knowledge graph based on the user graph data to obtain a triplet set, where each triplet in the triplet set includes three elements;
将所述三元组集合中每个三元组切割为三个二元组,得到切割后的三元组集合,所述切割后的三元组集合中每个二元组包括二元素;Cut each triplet in the triplet set into three tuples to obtain a cut triplet set, and each triplet in the cut triplet set includes two elements;
将所述切割后的三元组集合进行向量转换,得到所述候选向量。The cut triple set is vector converted to obtain the candidate vector.
其中,用户图谱数据可以是用户提供的结构化的图谱数据,例如,用户图谱数据包含各个节点数据(包括所有节点的属性信息)和关系数据(包括关系属性信息)的电子表格(excel)文件。抽取的excel中的数据可以存储到图数据库(neo4j)中,方便后续的使用。The user map data may be structured map data provided by the user. For example, the user map data contains a spreadsheet (excel) file of each node data (including attribute information of all nodes) and relationship data (including relationship attribute information). The extracted data in excel can be stored in the graph database (neo4j) to facilitate subsequent use.
面对没有大量标注数据的中小型领域图谱,通用的模型在识别实体和实体链接时效果不佳,在找到候选关系或者候选子图时就已经无法获取到正确的内容,最后导致整体的问答效果较差。因此,本发明在小样本领域的知识图谱中,根据用户图谱数据构建知识图谱,以得到三元组集合,三元组集合可以包括多个三元组,三元组可以是知识图谱中包括三元素<头节点-关系-尾节点>或<节点-属性-属性值>形式的知识。通过将三元组集合中每个三元组切割为三个二元组,计算候选二元组与问题之间的相关性,避免了实体识别的步骤,并且将三元组切割为二元组后,避免了三元组中答案实体干扰相似度计算的情况,提高了问答准确率。Faced with small and medium-sized domain maps without a large amount of annotated data, general models are not effective in identifying entities and entity links. They are no longer able to obtain the correct content when finding candidate relationships or candidate subgraphs, which ultimately leads to the overall question and answer effect. Poor. Therefore, in the knowledge graph in the small sample field, the present invention constructs a knowledge graph according to the user graph data to obtain a triplet set. The triplet set can include multiple triples, and the triplet can be a triplet set in the knowledge graph. Knowledge in the form of elements <head node-relationship-tail node> or <node-attribute-attribute value>. By cutting each triple in the triple set into three tuples, the correlation between the candidate tuples and the question is calculated, avoiding the step of entity recognition, and cutting the triple into tuples. Finally, it avoids the situation where the answer entities in the triplet interfere with the similarity calculation and improves the accuracy of question and answer.
本实施方式中,如图2所示,三元组集合可以包括n个三元组,例如,<实体-关系-实体>1……<实体-关系-实体>i……<实体-关系-实体>n;将三元组集合中每个三元组切割为三个二元组,例如,可以将三元组<实体-关系-实体>1进行切割,得到三个二元组分别为包括二元素的<实体-关系>1、<关系-实体>1和<实体-实体>1,同理,对<实体-关系-实体>i进行切割,得到<实体-关系>i、<关系-实体>i和<实体-实体>i,对<实体-关系-实体>n进行切割,得到<实体-关系>n、<关系-实体>n和<实体-实体>n。即切割后的三元组集合可以包括:<实体-关系>1、<关系-实体>1和<实体-实体>1……<实体-关系>i、<关系-实体>i和<实体-实体>i……<实体-关系>n、<关系-实体>n和<实体-实体>n。其中,i和n均为正整数,且n大于i。In this embodiment, as shown in Figure 2, the set of triples may include n triples, for example, <entity-relationship-entity>1...<entity-relationship-entity>i...<entity-relationship- Entity>n; cut each triplet in the triplet set into three tuples. For example, you can cut the triplet <entity-relationship-entity>1 to get three tuples including Two elements of <entity-relationship>1, <relationship-entity>1 and <entity-entity>1, similarly, cut <entity-relationship-entity>i to get <entity-relationship>i, <relationship- Entity>i and <entity-entity>i, cut <entity-relationship-entity>n to obtain <entity-relationship>n, <relationship-entity>n and <entity-entity>n. That is, the cut triple set can include: <entity-relationship>1, <relationship-entity>1 and <entity-entity>1...<entity-relationship>i, <relationship-entity>i and <entity- Entity>i...<entity-relationship>n, <relationship-entity>n and <entity-entity>n. Among them, i and n are both positive integers, and n is greater than i.
然后,将切割后的三元组集合进行向量转换,得到候选向量。例如,分别对<实体-关系>1、<关系-实体>1和<实体-实体>1进行向量转换得到候选向量<向量(vector)1>1、<vector2>1和<vector3>1;分别对<实体-关系>i、<关系-实体>i和<实体-实体>i进行向量转换得到候选向量<vector1>i、<vector2>i和<vector3>i;分别对<实体-关系>n、<关系-实体>n和<实体-实体>n进行向量转换得到候选向量<vector1>n、<vector2>n和<vector3>n。这样,在计算问题向量与候选向量之间的相似度时,减少了三元组中答案部分对问答的干扰,提升了图谱问答系统问答用户问题的准确率。在实践中,此方案的效果更佳,对没有针对领域图谱做调整训练的通用模型,在使用分割式三元组组合方法,排除答案节点的干扰后,在知识领域的知识库问题集上的回答准确率甚至能从60%上升的90%以上。Then, the cut triple set is vector transformed to obtain candidate vectors. For example, perform vector conversion on <entity-relation>1, <relationship-entity>1 and <entity-entity>1 respectively to obtain candidate vectors <vector (vector)1>1, <vector2>1 and <vector3>1; respectively Perform vector conversion on <entity-relationship>i, <relationship-entity>i and <entity-entity>i to obtain candidate vectors <vector1>i, <vector2>i and <vector3>i; respectively, <entity-relationship>n , <relation-entity>n and <entity-entity>n are vector converted to obtain candidate vectors <vector1>n, <vector2>n and <vector3>n. In this way, when calculating the similarity between the question vector and the candidate vector, the interference of the answer part in the triplet on the question and answer is reduced, and the accuracy of the graph question and answer system in answering user questions is improved. In practice, this solution has better results. For general models that have not been trained for the domain map, after using the split triple combination method to eliminate the interference of answer nodes, the results on the knowledge base question set in the knowledge field are better. The answer accuracy rate can even increase from 60% to more than 90%.
与传统的信息检索方式对比,本发明提出的方法绕过了实体识别的步骤,避开了领域图谱专业内容较多,导致实体识别准确率低,以及中小样本数据没有标注数据用于提升实体识别准确率的问题。对比直接将所有三元组作为备选子图与用户问题进行相似度计算的方法,本系统使用分割式的三元组组合方法,在进行相似度计算时,排除了答案部分干扰相似计算导致正确答案所在的三元组相似度不是最高的问题。Compared with traditional information retrieval methods, the method proposed by the present invention bypasses the step of entity recognition, avoids the fact that the domain map has a lot of professional content, resulting in low accuracy of entity recognition, and there is no annotated data for small and medium sample data to improve entity recognition. Accuracy issue. Compared with the method of directly using all triples as candidate subgraphs to calculate the similarity with user questions, this system uses a split triple combination method. When calculating the similarity, it excludes the answer part from interfering with the similarity calculation and causing the correct answer. The answer is not the one with the highest similarity among the triples.
在一些可选地实施方式中,所述将所述切割后的三元组集合进行向量转换,得到所述候选向量,包括:In some optional implementations, performing vector conversion on the cut triple set to obtain the candidate vector includes:
将所述切割后的三元组集合进行向量转换后进行聚类,得到索引向量,每个索引向量为其所包括的候选向量的聚类中心,所述候选向量的数量大于所述索引向量的数量;The cut triple set is vector converted and then clustered to obtain index vectors. Each index vector is the clustering center of the candidate vectors it contains. The number of candidate vectors is greater than the number of the index vectors. quantity;
所述计算问题向量与候选向量之间的相似度,以在所述候选向量中确定目标向量,包括:The calculation of the similarity between the question vector and the candidate vector to determine the target vector in the candidate vector includes:
计算问题向量与索引向量之间的相似度,以在所述索引向量中确定目标索引向量,所述目标索引向量与所述问题向量的相似度大于所述索引向量中其他索引向量与所述问题向量的相似度;Calculate the similarity between the question vector and the index vector to determine a target index vector in the index vector, the similarity between the target index vector and the question vector is greater than that of other index vectors in the index vector and the question Similarity of vectors;
在所述目标索引向量所包括的候选向量中,计算问题向量与候选向量之间的相似度,确定目标向量。Among the candidate vectors included in the target index vector, the similarity between the question vector and the candidate vector is calculated to determine the target vector.
在一示例中,将三元组集合中每个三元组切割为三个二元组,并将切割后的三元组进行向量转换得到候选向量后,计算候选向量与问题向量之间的相关性时,可以将问题向量与所有候选向量进行遍历计算的方式,以在候选向量中确定目标向量,虽然有效提高了语义相似的准确率,但是,由于每个三元组切割为三个二元组,候选向量对应的拼接句数量相比于切割前三元组的拼接句数量增加了三倍,即需要对比的拼接句的数目增加了三倍,使得计算时间也会增加三倍。而且当知识数量较大的时候,本身计算时间也较长,因此,在另一些实施例中可以通过引入索引式的计算方式,来减少计算所花费的时间。In an example, each triple in the triple set is cut into three triples, and the cut triples are converted into vectors to obtain candidate vectors, and then the correlation between the candidate vectors and the problem vector is calculated. When the problem is solved, the problem vector and all candidate vectors can be traversed and calculated to determine the target vector among the candidate vectors. Although the accuracy of semantic similarity is effectively improved, since each triple is cut into three binary group, the number of spliced sentences corresponding to the candidate vectors has increased three times compared to the number of spliced sentences of the pre-cut triplets, that is, the number of spliced sentences that need to be compared has increased three times, and the calculation time has also increased three times. Moreover, when the amount of knowledge is large, the calculation time itself is also long. Therefore, in other embodiments, the index calculation method can be introduced to reduce the calculation time.
在另一实施例中,将三元组集合中每个三元组切割为三个二元组,并将切割后的三元组进行向量转换得到候选向量后,计算候选向量与问题向量之间的相关性时,可以将对所有候选向量进行遍历计算的方式改为对索引向量(index vector)进行计算的方式。In another embodiment, each triple in the triple set is cut into three triples, and the cut triples are converted into vectors to obtain candidate vectors, and then the relationship between the candidate vector and the problem vector is calculated. When the correlation is high, the method of traversing all candidate vectors can be changed to the method of calculating the index vector.
具体的,将切割后的三元组集合(例如:<实体-关系>1、<关系-实体>1和<实体-实体>1……<实体-关系>i、<关系-实体>i和<实体-实体>i……<实体-关系>n、<关系-实体>n和<实体-实体>n)进行向量转换,得到切割后的三元组向量集合(例如:<vector1>1、<vector2>1和<vector3>1……<vector1>i、<vector2>i和<vector3>i……<vector1>n、<vector2>n和<vector3>n)。Specifically, the cut triple set (for example: <entity-relationship>1, <relationship-entity>1 and <entity-entity>1...<entity-relationship>i, <relationship-entity>i and <entity-entity>i...<entity-relationship>n, <relationship-entity>n and <entity-entity>n) perform vector conversion to obtain the cut triplet vector set (for example: <vector1>1, <vector2>1 and <vector3>1...<vector1>i, <vector2>i and <vector3>i...<vector1>n, <vector2>n and <vector3>n).
然后,对切割后的三元组向量集合中所有候选向量<vector>进行聚类,将选出聚类的中心点作为这类<vector>的代表索引向量,即每个索引向量为其所包括的候选向量的聚类中心。例如,对切割后的三元组向量集合中所有候选向量<vector>进行聚类后,得到的索引向量可以包括<index vector>1……<index vector>j……<index vector>m,j和m均为正整数,且j小于m。其中,候选向量的数量大于索引向量的数量,即n大于m,以提升计算速度。其中,<index vector>1中可以包括诺干<vector>,<index vector>1为其包含的这诺干<vector>的聚类中心。Then, all candidate vectors <vector> in the cut triple vector set are clustered, and the center point of the cluster is selected as the representative index vector of this type of <vector>, that is, each index vector contains The cluster center of the candidate vector. For example, after clustering all candidate vectors <vector> in the cut triple vector set, the obtained index vectors can include <index vector>1...<index vector>j...<index vector>m,j and m are both positive integers, and j is less than m. Among them, the number of candidate vectors is greater than the number of index vectors, that is, n is greater than m, to improve calculation speed. Among them, <index vector>1 can include Nuogan <vector>, and <index vector>1 is the cluster center of the Noogan <vector> it contains.
其中,每个聚类的索引向量的数目可以定为1000-10000左右,根据知识库数目的大小,以及实践效果,可以设定不同的聚类数目。每个聚类中候选向量的数目以及索引的层数,在此不做限定,苦于根据实际情况确定,以使得在相似度计算时,可以精确的获取到最相似的vector,不至于错漏。如果数据库内容较大,索引的数目过多,可以对索引再次进行聚类,也就是设置二层索引,或者多层索引来减少计算的时间。Among them, the number of index vectors for each cluster can be set at about 1,000-10,000. Depending on the size of the knowledge base and the practical effect, different numbers of clusters can be set. The number of candidate vectors and the number of index levels in each cluster are not limited here. They should be determined according to the actual situation, so that when calculating the similarity, the most similar vector can be accurately obtained without making mistakes. If the database content is large and the number of indexes is too large, the indexes can be clustered again, that is, setting up a second-level index or a multi-level index to reduce the calculation time.
这样,在相似度计算时,首先计算问题向量与索引向量之间的相似度,以在索引向量中确定目标索引向量,目标索引向量与问题向量的相似度大于索引向量中其他索引向量与问题向量的相似度;在选出最相似的目标索引向量后,再在目标索引向量所包括的候选向量中,计算问题向量与候选向量之间的相似度,以确定目标向量。进而得到目标向量对应的知识组合句,从而获得答案。在使用索引计算后,相较传统方法,提高了问答准确率的同时,耗时更短。在提高精度的同时也提高了效率。In this way, when calculating the similarity, first calculate the similarity between the question vector and the index vector to determine the target index vector in the index vector. The similarity between the target index vector and the question vector is greater than the similarity between other index vectors and the question vector in the index vector. similarity; after selecting the most similar target index vector, calculate the similarity between the question vector and the candidate vector among the candidate vectors included in the target index vector to determine the target vector. Then the knowledge combination sentence corresponding to the target vector is obtained, and the answer is obtained. After using index calculation, compared with traditional methods, the accuracy of question and answer is improved and the time consumption is shorter. While improving accuracy, it also improves efficiency.
其中,假设知识库中三元组的数目为N,假设三元组的实体不重复,实体的数目为2N,在识别出用户问题中的实体后,需要将此实体与库中的2N个实体做比对。而本发明中会将一个三元组进行分割组合,产生候选句子3N个。暂时视作信息检索找出备选子图和选出答案所在子图部分的耗时与本发明根据选中句子找出对应答案的耗时相当,本发明的耗时是信息检索方法的3N/2N倍,也就是1.5倍。在对候选向量构建索引后,本发明问答耗时呈指数级下降,最终耗时会与信息检索耗时相当,甚至更短。Among them, it is assumed that the number of triples in the knowledge base is N, assuming that the entities of the triples are not repeated, and the number of entities is 2N. After identifying the entity in the user question, this entity needs to be compared with the 2N entities in the database. Make comparisons. In the present invention, a triplet will be divided and combined to generate 3N candidate sentences. Temporarily regarded as information retrieval, the time consuming to find alternative subgraphs and select the part of the subgraph where the answer is located is equivalent to the time consuming of the present invention to find the corresponding answer based on the selected sentence. The time consuming of the present invention is 3N/2N of the information retrieval method. times, that is, 1.5 times. After constructing an index for the candidate vectors, the question and answer time of the present invention decreases exponentially, and the final time-consuming will be equivalent to, or even shorter than, the time-consuming of information retrieval.
在实验中,对用户问题进行回答的过程中,耗时较长的第一方面是计算问题向量与候选向量之间的相似度,此处的耗时与知识库内容大小也就是组合句子的数目成正相关,在使用分割式的句子组合方式之后,此处花费的时间更是常见组合方式花费时间的三倍。对于第一方面,本发明通过设置索引向量的方式进行聚类,缩短了问答耗时。耗时较长的第二方面则是将切割后的三元组集合进行向量转换,得到候选向量的过程。In the experiment, in the process of answering user questions, the first aspect that takes a long time is to calculate the similarity between the question vector and the candidate vector. The time consumption here is related to the size of the knowledge base content, that is, the number of combined sentences. There is a positive correlation. After using the segmented sentence combination method, the time spent here is three times that of the common combination method. Regarding the first aspect, the present invention performs clustering by setting index vectors, thereby shortening the time required for question and answer. The second aspect that takes a long time is the process of vector conversion of the cut triple set to obtain candidate vectors.
因此,在一些可选地实施方式中,在所述计算问题向量与候选向量之间的相似度之前,所述将所述切割后的三元组集合进行向量转换,得到所述候选向量之后,所述方法还包括:Therefore, in some optional implementations, before calculating the similarity between the problem vector and the candidate vector, the cut triple set is vector transformed, and after the candidate vector is obtained, The method also includes:
将所述候选向量预先存储在本地或数据库,所述数据库与计算设备相关联,所述计算设备用于计算问题向量与候选向量之间的相似度。The candidate vectors are pre-stored locally or in a database, the database is associated with a computing device, and the computing device is used to calculate the similarity between the question vector and the candidate vector.
本实施方式中,针对将切割后的三元组集合进行向量转换的部分,由于在知识图谱构建完毕,内容稳定后,根据图谱中知识三元组内容映射的组合句子向量的内容都是固定的,所以向量的转换部分可以不放在问答部分,而是在图谱构建或者训练的时候提前完成,并将候选向量预先存储在本地或数据库,数据库与计算设备相关联,所述计算设备用于计算问题向量与候选向量之间的相似度。这样,在问答时就只需要将用户问题进行向量转换得到问题向量,节省了将切割后的三元组集合进行向量转换的时间。In this implementation, for the vector conversion part of the cut triple set, after the knowledge graph is constructed and the content is stabilized, the contents of the combined sentence vectors mapped according to the content of the knowledge triples in the graph are fixed. , so the conversion part of the vector does not need to be placed in the question and answer part, but is completed in advance during map construction or training, and the candidate vectors are pre-stored locally or in a database. The database is associated with the computing device, and the computing device is used for calculation The similarity between the question vector and the candidate vector. In this way, when answering questions, you only need to vector-convert user questions to obtain question vectors, saving the time of vector-converting the cut triple set.
在一些可选地实施方式中,在步骤101所述计算问题向量与候选向量之间的相似度之前,所述方法还包括:In some optional implementations, before calculating the similarity between the question vector and the candidate vector in step 101, the method further includes:
训练向量转换模型,所述向量转换模型用于基于用户问题进行向量转换得到问题向量,以及基于切割后的三元组集合进行向量转换得到候选向量;Train a vector conversion model, which is used to perform vector conversion based on user questions to obtain question vectors, and perform vector conversion based on the cut triple set to obtain candidate vectors;
其中,所述向量转换模型为预训练语言模型RoFormer-Sim,所述RoFormer-Sim中的归一化算法更改为均方根值RMS算法。Wherein, the vector conversion model is a pre-trained language model RoFormer-Sim, and the normalization algorithm in the RoFormer-Sim is changed to the root mean square value RMS algorithm.
本实施方式中,可以使用文本语义相似的方法来提供图谱问答服务。在构建图谱,获取到三元组集合后,对三元组进行切割组合,然后将切割后的三元组集合通过向量转换模型转化为候选向量。本发明中向量转换模型结合Roformer-sim和RoformerV2的改良模型,改良后的模型在保持效果的同时,在训练方式和模型结构上都进行了优化,有效减少了训练时间。In this embodiment, a method using similar text semantics can be used to provide graph question and answer services. After constructing the map and obtaining the triplet set, the triplet set is cut and combined, and then the cut triplet set is converted into a candidate vector through the vector conversion model. The vector conversion model in the present invention combines the improved model of Roformer-sim and RoformerV2. The improved model maintains the effect while optimizing the training method and model structure, effectively reducing the training time.
常用的语义相似模型有传统的统计学模型(例如,用于信息检索与数据挖掘的加权技术(term frequency–inverse document frequency,TF-IDF)、simHash等)、深度学习模型(例如,深度结构化语义模型(Deep Structured SemanticModel,DSSM)、变压器的双向编码器表示(Bidirectional Encoder Representation from Transformers,BERT)及其变种模型等)。当前流行使用的相似模型为BERT和其变种模型。然而在使用基础BERT模型做相似匹配来进行图谱问答时,无法获取到很好的效果。Commonly used semantic similarity models include traditional statistical models (for example, weighted technology for information retrieval and data mining (term frequency–inverse document frequency, TF-IDF), simHash, etc.), deep learning models (for example, deep structured Semantic model (Deep Structured Semantic Model, DSSM), Bidirectional Encoder Representation from Transformers (BERT) and its variant models, etc.). The currently popular similar models are BERT and its variant models. However, when using the basic BERT model to perform similar matching for graph question answering, good results cannot be obtained.
为改善问答的效果,相似对比部分,本发明中使用结合RoFormer-Sim和RoFormerV2的改良模型的方式,力求在保持准确率的基础上,尽量多的提升训练效率。In order to improve the effect of question and answer and similar comparison, the present invention uses an improved model that combines RoFormer-Sim and RoFormerV2, striving to improve training efficiency as much as possible while maintaining accuracy.
其中,RoFormer-Sim模型是一个效果良好的BERT优化模型SimBERT的升级版,可以通俗地称之为“SimBERTv2”,也可以叫RoFormer-Sim。从外部看,除了基础架构换成了RoFormer外,RoFormer-Sim跟SimBERT没什么明显差别,事实上它们主要的区别在于训练的细节上,可以用两个公式进行对比:Among them, the RoFormer-Sim model is an upgraded version of the effective BERT optimization model SimBERT, which can be popularly called "SimBERTv2" or RoFormer-Sim. From the outside, except that the basic architecture is replaced by RoFormer, there is no obvious difference between RoFormer-Sim and SimBERT. In fact, their main difference lies in the details of training. Two formulas can be used for comparison:
公式1:SimBERT=BERT+UniLM+对比学习Formula 1: SimBERT=BERT+UniLM+Contrastive Learning
公式2:RoFormer-Sim=RoFormer+UniLM+对比学习+BART+蒸馏Formula 2: RoFormer-Sim=RoFormer+UniLM+Contrastive Learning+BART+Distillation
除此之外,RoFormer-Sim用到了更多的训练数据,并且拓展到了一般句式,也就是说,不同于SimBERT仅仅局限于疑问句,RoFormer-Sim可以用来做一般句子的相似句生成,适用场景更大。相比原始的BERT模型与改良的SimBERT模型,RoFormer-Sim模型更适用于当前的任务。In addition, RoFormer-Sim uses more training data and extends to general sentence patterns. In other words, unlike SimBERT which is only limited to interrogative sentences, RoFormer-Sim can be used to generate similar sentences for general sentences. It is suitable for The scene is bigger. Compared with the original BERT model and the improved SimBERT model, the RoFormer-Sim model is more suitable for the current task.
RoFormerV2模型则是对RoFormer模型结构的优化,在结构上,RoFormerV2主要去掉了模型的所有Bias项,以及Layer Norm换成了简单的RMS Norm,并且去掉了RMS Norm的gamma参数。对这些结构进行优化后,模型的训练速度提升至基线模型的1.3倍。将RoFormer-Sim中的归一化算法更改为均方根值RMS算法,进一步优化训练速度。The RoFormerV2 model is an optimization of the RoFormer model structure. Structurally, RoFormerV2 mainly removes all Bias items of the model, replaces the Layer Norm with a simple RMS Norm, and removes the gamma parameter of the RMS Norm. After optimizing these structures, the training speed of the model is increased to 1.3 times that of the baseline model. Change the normalization algorithm in RoFormer-Sim to the root mean square value RMS algorithm to further optimize the training speed.
RoFormer-Sim沿用了Transformer经典归一化算法(Layer Normalization,LN),对于第l层的隐藏层,令al表示输入向量(前层网络输出加权后的向量)其均值和方差的统计量可以是:RoFormer-Sim follows the Transformer classic normalization algorithm (Layer Normalization, LN). For the hidden layer of layer l, let a l represent the input vector (the weighted vector output by the front layer network). The statistics of its mean and variance can be yes:
其中,μ表示均值,σ表示方差H表示隐藏层单元的数量。Among them, μ represents the mean, σ represents the variance, and H represents the number of hidden layer units.
则LN算法的al归一后的值 Then the normalized value of a l of the LN algorithm
其中∈表示一个很小的数,防止分母为0。where ∈ represents a very small number to prevent the denominator from being 0.
为保证归一化操作不会破坏之前的信息,新增一组参数,增益g和偏置b,则LN最终的输出为:In order to ensure that the normalization operation will not destroy the previous information, a new set of parameters, gain g and bias b, are added. The final output of LN is:
由于LN计算较为复杂,为了加快训练速度,可以使用更加简单的RMS进行归一化操作,并且去掉了RMS Norm的gamma参数,本发明中使用的RMS算法第i层的输出为:Since LN calculation is relatively complex, in order to speed up the training, a simpler RMS can be used for normalization operation, and the gamma parameter of RMS Norm is removed. The output of the i-th layer of the RMS algorithm used in the present invention is:
修改模型RoFormer-Sim,将模型的归一化算法更改为RMS。做相似度计算时,使用修改后的RoFormer-Sim模型对用户问句和三元组的拼接句做转换,计算问题向量与候选向量之间的相似度,以在候选向量中确定目标向量。经过实验,使用此模型进行映射对比使用Bert做映射时,针对一些有若干字不同的领域知识表现效果更佳,更换模型后对文字上区别较小的专业词汇识别的准确率提高了很多。Modify the model RoFormer-Sim and change the normalization algorithm of the model to RMS. When calculating the similarity, use the modified RoFormer-Sim model to convert the user question and the spliced sentence of triples, calculate the similarity between the question vector and the candidate vector, and determine the target vector among the candidate vectors. After experiments, when using this model for mapping compared to using Bert for mapping, the performance is better for some domain knowledge with several different words. After changing the model, the accuracy of identifying professional words with small differences in words has been greatly improved.
参见图3,图3是本发明实施例提供的图谱问答装置的结构图,如图3所示,所述图谱问答装置300包括:Referring to Figure 3, Figure 3 is a structural diagram of a graph question and answer device provided by an embodiment of the present invention. As shown in Figure 3, the graph question and answer device 300 includes:
计算模块301,用于计算问题向量与候选向量之间的相似度,以在所述候选向量中确定目标向量,其中,所述问题向量为基于用户问题进行向量转换得到的向量,所述候选向量为基于切割后的三元组集合进行向量转换得到的向量集合,所述目标向量与所述问题向量的相似度大于所述候选向量中其他候选向量与所述问题向量的相似度;The calculation module 301 is used to calculate the similarity between a question vector and a candidate vector to determine a target vector in the candidate vector, wherein the question vector is a vector obtained by vector conversion based on the user question, and the candidate vector It is a vector set obtained by vector conversion based on the cut triple set, and the similarity between the target vector and the problem vector is greater than the similarity between other candidate vectors in the candidate vectors and the problem vector;
回答模块302,用于根据所述目标向量对应的三元组回答所述用户问题。The answering module 302 is used to answer the user question according to the triplet corresponding to the target vector.
可选地,图谱问答装置300还包括:Optionally, the graph question and answer device 300 also includes:
构建模块,用于基于用户图谱数据构建知识图谱,得到三元组集合,所述三元组集合中每个三元组包括三元素;A building module, used to construct a knowledge graph based on user graph data to obtain a triplet set, where each triplet in the triplet set includes three elements;
切割模块,用于将所述三元组集合中每个三元组切割为三个二元组,得到切割后的三元组集合,所述切割后的三元组集合中每个二元组包括二元素;A cutting module, used to cut each triplet in the triplet set into three tuples to obtain a cut triplet set. Each triplet in the cut triplet set is Includes two elements;
转换模块,用于将所述切割后的三元组集合进行向量转换,得到所述候选向量。A conversion module, configured to perform vector conversion on the cut triple set to obtain the candidate vector.
可选地,所述转换模块包括:Optionally, the conversion module includes:
聚类子模块,用于将所述切割后的三元组集合进行向量转换后进行聚类,得到索引向量,每个索引向量为其所包括的候选向量的聚类中心,所述候选向量的数量大于所述索引向量的数量;The clustering submodule is used to perform vector conversion on the cut triple set and perform clustering to obtain index vectors. Each index vector is the clustering center of the candidate vectors it contains. The number is greater than the number of index vectors;
所述计算模块301包括:The calculation module 301 includes:
第一计算子模块,用于计算问题向量与索引向量之间的相似度,以在所述索引向量中确定目标索引向量,所述目标索引向量与所述问题向量的相似度大于所述索引向量中其他索引向量与所述问题向量的相似度;The first calculation sub-module is used to calculate the similarity between the question vector and the index vector to determine the target index vector in the index vector, and the similarity between the target index vector and the question vector is greater than the index vector The similarity between other index vectors in and the problem vector;
第二计算子模块,用于在所述目标索引向量所包括的候选向量中,计算问题向量与候选向量之间的相似度,确定目标向量。The second calculation submodule is used to calculate the similarity between the question vector and the candidate vector among the candidate vectors included in the target index vector, and determine the target vector.
可选地,图谱问答装置300还包括:Optionally, the graph question and answer device 300 also includes:
存储模块,用于将所述候选向量预先存储在本地或数据库,所述数据库与计算设备相关联,所述计算设备用于计算问题向量与候选向量之间的相似度。A storage module, configured to pre-store the candidate vector locally or in a database, where the database is associated with a computing device, and the computing device is configured to calculate the similarity between the question vector and the candidate vector.
可选地,图谱问答装置300还包括:Optionally, the graph question and answer device 300 also includes:
训练模块,用于训练向量转换模型,所述向量转换模型用于基于用户问题进行向量转换得到问题向量,以及基于切割后的三元组集合进行向量转换得到候选向量;A training module, used to train a vector conversion model, which is used to perform vector conversion based on user questions to obtain question vectors, and perform vector conversion based on the cut triple set to obtain candidate vectors;
其中,所述向量转换模型为预训练语言模型RoFormer-Sim,所述RoFormer-Sim中的归一化算法更改为均方根值RMS算法。Wherein, the vector conversion model is a pre-trained language model RoFormer-Sim, and the normalization algorithm in the RoFormer-Sim is changed to the root mean square value RMS algorithm.
需要说明的是,图谱问答装置300能够实现图1所示的方法实施例中的各个过程,为避免重复,这里不再赘述。It should be noted that the graph question and answer device 300 can implement each process in the method embodiment shown in Figure 1. To avoid duplication, the details will not be described here.
本发明实施例还提供了一种图谱问答系统,包括:Embodiments of the present invention also provide a graph question and answer system, including:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively connected to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上述图谱问答方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。The memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform various processes such as the above graph question and answer method embodiment, And can achieve the same technical effect, so to avoid repetition, they will not be described again here.
本发明实施例还提供了一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使所述计算机执行如上述的图谱问答方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。其中,所述的计算机可读存储介质,如ROM、RAM、磁碟或者光盘等。Embodiments of the present invention also provide a non-transient computer-readable storage medium storing computer instructions. The computer instructions are used to cause the computer to execute each process of the above-mentioned graph question and answer method embodiment, and can achieve the same To avoid repetition, the technical effects will not be repeated here. Wherein, the computer-readable storage medium is such as ROM, RAM, magnetic disk or optical disk.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本发明实施方式中的方法和装置的范围不限于按所讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this document, the terms "comprising", "comprises" or any other variations thereof are intended to cover a non-exclusive inclusion, such that a process, method, article or device that includes a series of elements not only includes those elements, It also includes other elements not expressly listed or inherent in the process, method, article or apparatus. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article or apparatus that includes that element. In addition, it should be pointed out that the scope of the methods and apparatuses in the embodiments of the present invention is not limited to performing the functions in the order discussed, but may also include performing the functions in a substantially simultaneous manner or in the reverse order according to the involved functions. For example, the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence or the part that contributes to the existing technology. The computer software product is stored in a storage medium (such as ROM/RAM, disk, CD), including several instructions to cause a terminal (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in various embodiments of the present invention.
上面结合附图对本发明的实施例进行了描述,但是本发明并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本发明的启示下,在不脱离本发明宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本发明的保护之内。The embodiments of the present invention have been described above in conjunction with the accompanying drawings. However, the present invention is not limited to the above-mentioned specific implementations. The above-mentioned specific implementations are only illustrative and not restrictive. Those of ordinary skill in the art will Under the inspiration of the present invention, many forms can be made without departing from the spirit of the present invention and the scope protected by the claims, all of which fall within the protection of the present invention.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310217398.3A CN116910189B (en) | 2023-03-08 | 2023-03-08 | Atlas question-answering method, apparatus, system and medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310217398.3A CN116910189B (en) | 2023-03-08 | 2023-03-08 | Atlas question-answering method, apparatus, system and medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN116910189A true CN116910189A (en) | 2023-10-20 |
| CN116910189B CN116910189B (en) | 2026-02-03 |
Family
ID=88365528
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310217398.3A Active CN116910189B (en) | 2023-03-08 | 2023-03-08 | Atlas question-answering method, apparatus, system and medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116910189B (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102682162A (en) * | 2012-04-24 | 2012-09-19 | 河海大学 | Organizational overlapping core drug group discovery method based on complex network community discovery |
| CN111639171A (en) * | 2020-06-08 | 2020-09-08 | 吉林大学 | Knowledge graph question-answering method and device |
| CN112115687A (en) * | 2020-08-26 | 2020-12-22 | 华南理工大学 | Problem generation method combining triples and entity types in knowledge base |
| CN112632250A (en) * | 2020-12-23 | 2021-04-09 | 南京航空航天大学 | Question and answer method and system under multi-document scene |
-
2023
- 2023-03-08 CN CN202310217398.3A patent/CN116910189B/en active Active
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102682162A (en) * | 2012-04-24 | 2012-09-19 | 河海大学 | Organizational overlapping core drug group discovery method based on complex network community discovery |
| CN111639171A (en) * | 2020-06-08 | 2020-09-08 | 吉林大学 | Knowledge graph question-answering method and device |
| CN112115687A (en) * | 2020-08-26 | 2020-12-22 | 华南理工大学 | Problem generation method combining triples and entity types in knowledge base |
| WO2022041294A1 (en) * | 2020-08-26 | 2022-03-03 | 华南理工大学 | Method of generating questions by combining triple and entity type in knowledge base |
| CN112632250A (en) * | 2020-12-23 | 2021-04-09 | 南京航空航天大学 | Question and answer method and system under multi-document scene |
Non-Patent Citations (2)
| Title |
|---|
| 彭莉兰: "面向知识图谱的时间问题回答语料库构建研究", 《中国优秀硕士论文电子期刊网》, 31 July 2021 (2021-07-31) * |
| 苏剑林: "RoFormerV2:自然语言理解的极限探索", Retrieved from the Internet <URL:《知乎 https://zhuanlan.zhihu.com/p/485539469》> * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116910189B (en) | 2026-02-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN118606348B (en) | Intelligent question answering method, system and device based on large language model and database | |
| CN116127020B (en) | Generative large language model training method and model-based search method | |
| CN116127046B (en) | Training method for generating large language model and man-machine voice interaction method based on model | |
| CN116127045B (en) | Training method for generating large language model and man-machine voice interaction method based on model | |
| CN114547274B (en) | Multi-turn question and answer method, device and equipment | |
| US11734358B2 (en) | Inferring intent and utilizing context for natural language expressions in a data visualization user interface | |
| CN116343766B (en) | Generating type large model training method and man-machine voice interaction method based on model | |
| US20210018332A1 (en) | Poi name matching method, apparatus, device and storage medium | |
| CN111353030A (en) | Knowledge question and answer retrieval method and device based on travel field knowledge graph | |
| WO2025066259A1 (en) | Background knowledge augmentation-based sql generation method and apparatus, device and medium | |
| CN112069826A (en) | Vertical Domain Entity Disambiguation Method Fusing Topic Models and Convolutional Neural Networks | |
| CN116932730B (en) | Document question-answering method and related equipment based on multi-way tree and large-scale language model | |
| CN116244416A (en) | Training method for generating large language model and man-machine voice interaction method based on model | |
| CN109885665A (en) | A kind of data query method, apparatus and system | |
| CN111309930B (en) | A Representation Learning-Based Method for Entity Alignment in Medical Knowledge Graph | |
| CN113190593A (en) | Search recommendation method based on digital human knowledge graph | |
| WO2025123841A1 (en) | Knowledge graph construction method and apparatus, and storage medium and electronic device | |
| CN112115252A (en) | Intelligent auxiliary writing processing method and device, electronic equipment and storage medium | |
| CN115878814A (en) | Knowledge graph question-answering method and system based on machine reading understanding | |
| WO2021135103A1 (en) | Method and apparatus for semantic analysis, computer device, and storage medium | |
| CN114461749B (en) | Data processing method and device for conversation content, electronic equipment and medium | |
| CN118820402A (en) | Information extraction method, device and electronic equipment based on large model | |
| CN117573817A (en) | Model training method, correlation determining method, device, equipment and storage medium | |
| CN119829602A (en) | Data query method and device based on large model and electronic equipment | |
| CN116910189A (en) | Atlas question-answering method, apparatus, system and medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant |