CN118779468A

CN118779468A - A method for entity relationship extraction based on large model and dynamic prompts

Info

Publication number: CN118779468A
Application number: CN202411064169.3A
Authority: CN
Inventors: 赵霞; 刘雪姣; 黄江勇; 于重重; 周东岳
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2024-04-30
Filing date: 2024-08-05
Publication date: 2024-10-15

Abstract

本发明公开了一种基于大模型和动态提示的实体关系抽取方法，该方法包括：定义领域知识图谱的模式层及实体集；利用模式层及实体集构建实体关系三元组示例集；构建实体向量数据库DB；构建prompt模板集PR；将待处理文本划分为段落，并提取段落的关键词列表；利用关键词列表为每个段落构造段落‑实体关联列表T；对每个段落，利用关联列表T、prompt模板集PR和实体关系三元组示例集构造动态提示，将动态提示送入大模型进行实体关系抽取，并检验结果的正确性。本发明融合了知识图谱、实体向量表示、动态提示与大模型的优势，无需微调大模型，实现实体关系的自动抽取，降低了实体关系抽取的成本，提升了效率和正确率，具有广泛的应用推广价值。The present invention discloses an entity relationship extraction method based on a large model and dynamic prompts, the method comprising: defining a pattern layer and an entity set of a domain knowledge graph; constructing an entity relationship triplet example set using the pattern layer and the entity set; constructing an entity vector database DB; constructing a prompt template set PR; dividing the text to be processed into paragraphs, and extracting a keyword list of the paragraphs; constructing a paragraph-entity association list T for each paragraph using the keyword list; for each paragraph, constructing a dynamic prompt using the association list T, the prompt template set PR and the entity relationship triplet example set, sending the dynamic prompt to the large model for entity relationship extraction, and verifying the correctness of the result. The present invention integrates the advantages of knowledge graphs, entity vector representation, dynamic prompts and large models, does not need to fine-tune the large model, realizes automatic extraction of entity relationships, reduces the cost of entity relationship extraction, improves efficiency and accuracy, and has a wide range of application and promotion value.

Description

A method for entity relationship extraction based on large model and dynamic prompts

技术领域Technical Field

本发明涉及实体关系抽取，具体涉及一种基于大模型和动态提示的实体关系抽取方法，属于人工智能应用领域。The present invention relates to entity relationship extraction, and in particular to an entity relationship extraction method based on a large model and dynamic prompts, and belongs to the field of artificial intelligence applications.

背景技术Background Art

随着大语言模型在各类任务中展现出卓越的表现力，人们正在积极探究其在不同领域内的广泛应用可能性。现有的大模型使用通用领域的知识语料库训练，能够从广泛的文本中学习语言模式，从非结构化文本中抽取实体关系用于构建通用领域知识图谱。由于缺乏特定领域知识语料的训练，并且特定领域的术语、概念和实体关系较复杂，现有通用大模型用于特定领域的实体关系抽取，尚无法达到预期效果。As large language models have shown excellent expressiveness in various tasks, people are actively exploring their wide application possibilities in different fields. Existing large models are trained using knowledge corpora in general fields, and can learn language patterns from a wide range of texts and extract entity relationships from unstructured texts to build general domain knowledge graphs. Due to the lack of training in knowledge corpora in specific fields, and the complexity of terminology, concepts, and entity relationships in specific fields, existing general large models cannot achieve the expected results when used to extract entity relationships in specific fields.

为了提高大语言模型抽取实体关系准确性，研究者们提出了两类方法：一是利用特定领域知识构建语料库进行大模型微调；二是通过提示工程技术为大模型提供辅助理解文本的提示。黄菁等在专利“基于动态提示的API实体-关系联合提取方法及系统”构建联合提取API实体-关系的动态提示并构建了一种结构化提取语言，基于结构化提取语言构造训练集和测试集以微调大模型来进行API实体关系联合抽取。该专利针对软件开发的API领域，通过动态提示提升大模型提取API描述文本中实体关系的准确率；杨莉等在专利“基于提示学习的实体关系抽取方法、装置、介质和设备”将句子特征进行编码并基于提示学习进行调优，将实体关系抽取任务分解为实体识别和关系分类两个阶段，结合问答任务，利用待抽取文本的语义信息，从给定的少量注释文本中抽取出三元组。该专利将实体关系抽取分为前后两个独立阶段，不能做联合抽取；黄宜华等在专利“一种基于动态提示学习的小样本嵌套关系抽取算法”提出基于动态提示学习的小样本嵌套关系抽取算法框架，针对嵌套关系抽取算法，提高关系识别准确率，而不涉及实体关系的联合抽取。In order to improve the accuracy of entity relationship extraction by large language models, researchers have proposed two types of methods: one is to use specific domain knowledge to build a corpus for fine-tuning large models; the other is to provide large models with prompts to assist in understanding text through prompt engineering technology. Huang Jing et al. constructed dynamic prompts for joint extraction of API entity-relationships in the patent "API entity-relationship joint extraction method and system based on dynamic prompts" and constructed a structured extraction language. Based on the structured extraction language, training sets and test sets were constructed to fine-tune the large model for joint extraction of API entity relationships. This patent targets the API field of software development and improves the accuracy of large models in extracting entity relationships in API description texts through dynamic prompts; Yang Li et al. encoded sentence features in the patent "Entity relationship extraction method, device, medium and equipment based on prompt learning" and optimized based on prompt learning, decomposing the entity relationship extraction task into two stages: entity recognition and relationship classification. Combined with question-answering tasks, the semantic information of the text to be extracted is used to extract triples from a given small amount of annotated text. This patent divides entity relationship extraction into two independent stages, and cannot perform joint extraction. Huang Yihua et al. proposed a small sample nested relationship extraction algorithm framework based on dynamic prompt learning in the patent "A small sample nested relationship extraction algorithm based on dynamic prompt learning", which improves the relationship recognition accuracy of the nested relationship extraction algorithm without involving the joint extraction of entity relationships.

发明内容Summary of the invention

本发明公开了一种基于大模型和动态提示的实体关系抽取方法，针对特定领域知识优化动态提示构造方法，利用大语言模型抽取非结构化文本（后简称文本）中的实体关系；该方法包括：1）定义领域知识图谱的模式层及实体集；2）利用模式层及实体集构建实体关系三元组示例集；3）构建实体向量数据库DB；4）构建prompt模板集PR；5）将待处理文本划分为段落，并提取段落的关键词列表；6）利用关键词列表为每个段落构造段落-实体关联列表T；7）对每个段落，利用关联列表T、prompt模板集PR和实体关系三元组示例集构造动态提示；8）对每个段落，将动态提示送入大模型进行实体关系抽取，并检验结果的正确性；具体来说，本发明的方法包括下列步骤：The present invention discloses an entity relationship extraction method based on a large model and dynamic prompts, optimizes a dynamic prompt construction method for specific domain knowledge, and uses a large language model to extract entity relationships in unstructured text (hereinafter referred to as text); the method comprises: 1) defining a pattern layer and an entity set of a domain knowledge graph; 2) constructing an entity relationship triple example set using the pattern layer and the entity set; 3) constructing an entity vector database DB; 4) constructing a prompt template set PR; 5) dividing the text to be processed into paragraphs, and extracting a keyword list of the paragraphs; 6) constructing a paragraph-entity association list T for each paragraph using the keyword list; 7) for each paragraph, constructing a dynamic prompt using the association list T, the prompt template set PR and the entity relationship triple example set; 8) for each paragraph, sending the dynamic prompt to the large model for entity relationship extraction, and checking the correctness of the result; specifically, the method of the present invention comprises the following steps:

A. 定义领域知识图谱的模式层及实体集，具体步骤如下：A. Define the model layer and entity set of the domain knowledge graph. The specific steps are as follows:

A1. 定义实体类别的集合C={C₁,…C_i,…C_N}(i∈[1,N])，实体类别间关系的集合R={R₁,…R_j,…R_M}(j∈[1,M])，N和M表示待建知识图谱中的实体类别数和关系数；A1. Define the set of entity categories C={C ₁ ,…C _i ,…C _N }(i∈[1,N]), the set of relationships between entity categories R={R ₁ ,…R _j ,…R _M }(j∈[1,M]), where N and M represent the number of entity categories and the number of relationships in the knowledge graph to be built;

A2. 从行业领域术语词典或目录中，抽取属于实体类别C_i(i∈[1,N])的词汇作为实体，构成该类别的实体集合E_i={e_i1,…e_ij,…,e_ik}（i∈[1,N]，j∈[1,k]，k表示实体集合中实体的个数，因实体类别C_i而异）；A2. Extract words belonging to entity category C _i (i∈[1,N]) from the industry terminology dictionary or catalog as entities to form the entity set E _i ={e _i1 ,…e _ij ,…,e _ik } of this category (i∈[1,N], j∈[1,k], k represents the number of entities in the entity set, which varies depending on the entity category C _i );

A3. 将具有关系R_j(j∈[1,M])的实体类别C_h(h∈[1,N])和实体类别C_t(t∈[1,N])记为实体类别三元组（C_h,R_j,C_t），其中C_h称为头实体类别，其实体集合记为E_hj，C_t称为尾实体类别，其实体集合记为E_tj；所有的实体类别三元组构成知识图谱的模式层；A3. The entity category _Ch (h∈[1,N]) and entity category C _t (t∈[1,N]) with the relation R _j (j∈[1,M]) are recorded as entity category triples ( _Ch , R _j , C _t ), where _Ch is called the head entity category and its entity set is recorded as E _hj , C _t is called the tail entity category and its entity set is recorded as E _tj ; all entity category triplets constitute the pattern layer of the knowledge graph;

B. 利用模式层及实体集构建实体关系三元组示例集，以实体类别三元组（C_h,R_j,C_t）为例，具体操作如下：B. Use the schema layer and entity set to construct an example set of entity relationship triples. Take the entity category triple (C _h , R _j , C _t ) as an example. The specific operations are as follows:

B1. 以满足R_j关系为条件，从实体集合E_h和实体集合E_t中选取F个实体对；B1. Select F entity pairs from entity set E _h and entity set E _t under the condition that they satisfy the R _j relationship;

B2. 将每个实体对和关系R_j，组成一个实体关系三元组(e_hi,R_j,e_ti)，其中实体e_hi∈E_h(i∈[1,F])，e_ti∈E_t(i∈[1,F])；B2. Combine each entity pair and relation R _j into an entity-relation triple (e _hi ,R _j ,e _ti ), where entity e _hi ∈E _h (i∈[1,F]), e _ti ∈E _t (i∈[1,F]);

B3. 用步骤B2构建的F个实体关系三元组，构建关系R_j的实体关系三元组示例集S_j={(e_hi,R_j,e_ti) for i in [1:F]}，其中for i in [1:F]表示i的取值为1到F的整数；B3. Using the F entity-relationship triples constructed in step B2, construct an example set of entity-relationship triples S _j ={(e _hi ,R _j ,e _ti ) for i in [1:F]} of relation R _j , where for i in [1:F] indicates that the value of i is an integer from 1 to F;

C. 构建实体向量数据库DB，具体步骤如下：C. Construct entity vector database DB. The specific steps are as follows:

C1. 对实体集E_i(i∈[1,N])里的每个实体e_ij∈E_i(i∈[1,N],j∈[1,|Ei|])，使用预训练语言模型编码为实体向量ve_ij；C1. For each entity e _ij ∈E _i (i∈[1,N],j∈[1,|Ei|]) in the entity set E _i (i∈[1,N]), encode it into an entity vector ve _ij using the pre-trained language model;

C2. 将每一对实体e_ij和实体向量ve_ij存入实体向量数据库DB；C2. storing each pair of entity e _ij and entity vector ve _ij into the entity vector database DB;

D. 构建prompt模板集PR，具体步骤如下：D. Build the prompt template set PR. The specific steps are as follows:

D1. 定义模板集PR={pr₁,…,pr_k,…,pr_n}(k∈[1,n])，n表示模板集PR中含有的模板数量；D1. Define the template set PR = {pr ₁ ,…,pr _k ,…,pr _n } (k∈[1,n]), where n represents the number of templates contained in the template set PR;

D2. 构建每个提示模板pr_k∈PR，模板中包含若干模板变量，具体步骤如下：D2. Construct each prompt template pr _k ∈PR, which contains several template variables. The specific steps are as follows:

D2.1定义领域变量为${field}，其中field的值域是领域标识的集合；D2.1 defines a domain variable as ${field}, where the value range of field is the set of domain identifiers;

D2.2定义段落变量为${P_i}，其中P_i的值域是段落标识的集合；D2.2 defines the paragraph variable as ${P _i }, where the value range of P _i is the set of paragraph tags;

D2.3定义实体变量为${e₁}、…、${e_j}、…、${e_m}，其中e₁~e_m是段落中对应的实体；D2.3 Define entity variables as ${e ₁ }, …, ${e _j }, …, ${e _m }, where e ₁ ~e _m are the corresponding entities in the paragraph;

D2.4定义关系变量为${R₁}、…、${R_j}、…、${R_M}，其中R₁~R_M是待建知识图谱中的所有关系；D2.4 Define the relationship variables as ${R ₁ }, …, ${R _j }, …, ${R _M }, where R ₁ ~R _M are all the relationships in the knowledge graph to be built;

D2.5定义实体关系三元组示例集变量为${S₁}、…、${S_j}、…、${S_M}，其中S₁~S_m是关系R₁~R_M对应的实体关系三元组示例集；D2.5 defines the entity-relationship triple example set variables as ${S ₁ }, …, ${S _j }, …, ${S _M }, where S ₁ ~S _m is the entity-relationship triple example set corresponding to relations R ₁ ~R _M ;

E. 将待处理文本划分为段落，并提取段落的关键词列表，具体步骤如下：E. Divide the text to be processed into paragraphs and extract the keyword list of the paragraphs. The specific steps are as follows:

E1. 将待处理的非结构化文本划分成段落，具体步骤如下：E1. Divide the unstructured text to be processed into paragraphs. The specific steps are as follows:

E1.1 清洗文本数据，去除无用的标点符号、特殊字符等；E1.2 使用预训练的词向量模型对文本进行词向量表示，生成词向量序列；E1.3 将词向量序列输入到神经网络序列标注模型，对文本进行主题边界检测，得到每个段落的起止位置；E1.1 Clean the text data and remove useless punctuation marks, special characters, etc.; E1.2 Use the pre-trained word vector model to represent the text with word vectors and generate word vector sequences; E1.3 Input the word vector sequence into the neural network sequence labeling model to detect the topic boundaries of the text and obtain the start and end positions of each paragraph;

E1.4 根据检测的主题边界，将文本划分成不同的段落，得到段落集合P={P₀,…P_i,…,P_t}(i∈[1,t])，t为集合中段落的总数；E1.4 Divide the text into different paragraphs according to the detected topic boundaries, and obtain the paragraph set P = {P ₀ ,…P _i ,…,P _t } (i∈[1,t]), where t is the total number of paragraphs in the set;

E2. 提取每个段落P_i（i∈[1,t]）中的关键词列表，具体步骤如下：E2. Extract the keyword list in each paragraph _Pi (i∈[1,t]). The specific steps are as follows:

E2.1提取每个段落P_i(i∈[1,t])的关键词，加入集合K_i={k_i1,k_i2,...,k_ip}(p为P_i的关键词总数)，关键词提取方法包括但不限于TF-IDF、TextRank、基于预训练语言模型的方法；E2.1 Extract keywords from each paragraph P _i (i∈[1,t]) and add them to the set K _i ={k _i1 ,k _i2 ,...,k _ip } (p is the total number of keywords in P _i ). Keyword extraction methods include but are not limited to TF-IDF, TextRank, and methods based on pre-trained language models;

E2.2 将段落P_i与关键词集合K_i组合为段落P_i的关键词列表PL_i=[P_i, K_i]；E2.2 Combine paragraph _Pi and keyword set _Ki into a keyword list PL _i = [P _i , _Ki ] of paragraph _Pi ;

F. 利用关键词列表为每个段落构造段落-实体关联列表T，以段落P_i为例，具体步骤如下：F. Use the keyword list to construct a paragraph-entity association list T for each paragraph. Taking paragraph _Pi as an example, the specific steps are as follows:

F1. 将K_i中的每个关键词k_ij∈K_i(j∈[1,p])，使用步骤C1的预训练语言模型将其编码为关键词向量vk_ij；F1. Encode each keyword k _ij ∈K _i (j∈[1,p]) in K _i into a keyword vector vk _ij using the pre-trained language model in step C1;

F2. 将关键词向量vk_ij在实体向量库中进行相似度搜索，获取Top-N个相似的实体，记为Top-N[k_ij]={e₁,e₂,…,e_N}；F2. Perform similarity search on the keyword vector vk _ij in the entity vector library to obtain the Top-N similar entities, denoted as Top-N[k _ij ]={e ₁ ,e ₂ ,…,e _N };

F3. 将K_i里的所有关键词对应的所有相似实体，去重，构成实体列表L_i；F3. Remove duplicates from all similar entities corresponding to all keywords in K _i to form an entity list L _i ;

F4. 将段落P_i、实体列表L_i组合为P_i的段落-实体关联列表T_i=[P_i, L_i];F4. Combine paragraph _Pi and entity list _Li into _Pi 's paragraph-entity association list _Ti = [ _Pi , _Li ];

G. 对每个段落，利用关联列表T、prompt模板集PR和实体关系三元组示例集构造动态提示，以段落P_i的关联列表T_i、模板pr_k和实体关系三元组示例集为例，具体步骤如下：G. For each paragraph, construct a dynamic prompt using the association list T, prompt template set PR and entity relationship triplet example set. Taking the association list _Ti , template pr _k and entity relationship triplet example set of paragraph _Pi as an example, the specific steps are as follows:

G1. 将提示模板pr_k中的模板变量，替换为对应的变量值，具体步骤如下：G1. Replace the template variables in the prompt template pr _k with the corresponding variable values. The specific steps are as follows:

G1.1 将关联列表T_i中的段落P_i代入段落变量${P_i}；G1.1 Substitute the paragraph _Pi in the association list _Ti into the paragraph variable ${ _Pi };

G1.2 将关联列表T_i中的实体列表L_i的每个实体分别代入实体变量${e₁}~${e_m}；G1.2 Substitute each entity of the entity list _Li in the association list _Ti into the entity variables ${e ₁ }~${e _m };

G1.3 将实体列表L_i的实体具有的关系R₁~R_j代入关系变量${R₁}~${R_j}；G1.3 Substitute the relations R ₁ ~R _j of the entities in the entity list _Li into the relation variables ${R ₁ }~${R _j };

G1.4 将关系R₁~R_j对应的实体三元组示例集S₁~S_j里的实体关系三元组示例集，代入实体关系三元组示例集变量${S₁} ~${S_j}；G1.4 Substitute the entity relationship triple example set in the entity triple example set S ₁ ~S _j corresponding to the relation R ₁ ~R _j into the entity relationship triple example set variable ${S ₁ } ~${S _j };

G2. 对于模板中的剩余变量，按照模板变量的定义输入对应的值，形成一个完整的动态提示内容Q_kP_i(k∈[1,n],i∈[1,t])，例如，输入field（领域名称）代入领域变量${field}；G2. For the remaining variables in the template, enter the corresponding values according to the definition of the template variables to form a complete dynamic prompt content Q _k P _i (k∈[1,n],i∈[1,t]). For example, enter field (field name) and substitute it into the field variable ${field};

H. 将动态提示内容送入大模型进行实体关系抽取，并检验结果的正确性，具体步骤如下：H. Send the dynamic prompt content to the big model for entity relationship extraction and verify the correctness of the results. The specific steps are as follows:

H1. 通过调用API的方式与大模型对话，将动态提示内容送入多个大模型中进行实体关系抽取；H1. Communicate with the big model by calling the API, and send the dynamic prompt content to multiple big models for entity relationship extraction;

H2. 对比分析多个大模型的实体关系抽取结果，选取最优抽取结果。H2. Compare and analyze the entity relationship extraction results of multiple large models and select the optimal extraction result.

本发明与现有技术相比，具有以下优点：实现了一套完整的面向特定领域非结构化文本的实体关系抽取方法，从构建领域知识图谱的模式层和实体集开始，动态构建prompt所需要的各种要素，并生成动态提示，引导大模型进行特定领域的实体关系抽取；利用预训练语言模型构建实体向量数据库，提升了段落文本中实体识别和关系抽取的准确性。本发明融合了知识图谱、实体向量表示、动态提示与大模型的优势，无需微调大模型，实现实体关系的自动抽取，降低了实体关系抽取的成本，提升了效率和正确率，具有广泛的应用推广价值。Compared with the prior art, the present invention has the following advantages: a complete set of entity relationship extraction methods for unstructured text in specific fields is implemented, starting from the pattern layer and entity set of the domain knowledge graph, the various elements required for prompt are dynamically constructed, and dynamic prompts are generated to guide the large model to extract entity relationships in specific fields; the entity vector database is constructed using a pre-trained language model, which improves the accuracy of entity recognition and relationship extraction in paragraph text. The present invention combines the advantages of knowledge graph, entity vector representation, dynamic prompt and large model, without fine-tuning the large model, to achieve automatic extraction of entity relationships, reduce the cost of entity relationship extraction, improve efficiency and accuracy, and has a wide range of application and promotion value.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1：一种基于大模型和动态提示的实体关系抽取方法流程图Figure 1: Flowchart of an entity relationship extraction method based on a large model and dynamic prompts

具体实施方式DETAILED DESCRIPTION

下面结合附图和具体实施实例对本发明进行详细说明。The present invention is described in detail below with reference to the accompanying drawings and specific implementation examples.

本发明公开了一种基于大模型和动态提示的实体关系抽取方法，该方法包括：1）定义领域知识图谱的模式层及实体集；2）利用模式层及实体集构建实体关系三元组示例集；3）构建实体向量数据库DB；4）构建prompt模板集PR；5）将待处理文本划分为段落，并提取段落的关键词列表；6）利用关键词列表为每个段落构造段落-实体关联列表T；7）对每个段落，利用关联列表T、prompt模板集PR和实体关系三元组示例集构造动态提示；8）对每个段落，将动态提示送入大模型进行实体关系抽取，并检验结果的正确性。下面按照步骤，以Python程序设计语言课程知识图谱的一部分为例，以其中一个实体类的部分实体及其相关的部分关系的抽取为实例，对本发明中的技术方案进行清楚、完整地描述。The present invention discloses an entity relationship extraction method based on a large model and dynamic prompts, the method comprising: 1) defining a pattern layer and an entity set of a domain knowledge graph; 2) constructing an entity relationship triple example set using the pattern layer and the entity set; 3) constructing an entity vector database DB; 4) constructing a prompt template set PR; 5) dividing the text to be processed into paragraphs, and extracting a keyword list of the paragraphs; 6) constructing a paragraph-entity association list T for each paragraph using the keyword list; 7) for each paragraph, constructing a dynamic prompt using the association list T, the prompt template set PR and the entity relationship triple example set; 8) for each paragraph, sending the dynamic prompt to the large model for entity relationship extraction, and checking the correctness of the result. The following steps are taken as an example of a part of the knowledge graph of the Python programming language course, and the extraction of some entities of one entity class and their related partial relations as an example, to clearly and completely describe the technical solution in the present invention.

1. 定义领域知识图谱的模式层及实体集，具体步骤如下：1. Define the model layer and entity set of the domain knowledge graph. The specific steps are as follows:

1.1 Python程序设计语言课程知识图谱的一部分需求，是抽取教材中的各种知识点，并将不同知识点之间的包含关系和依赖关系抽取出来；1.1 Part of the requirements for the Python programming language course knowledge graph is to extract various knowledge points in the textbook and extract the inclusion and dependency relationships between different knowledge points;

根据这个需求，定义实体类别{知识点}，记为C₁，实体类别集合记为C={知识点}；知识点实体间的关系集合R={包含关系，依赖关系}，包含关系记为R₁，依赖关系记为R₂，待建知识图谱中的实体类别数M=1，关系类别数N=2；According to this requirement, we define the entity category {knowledge point}, denoted as C ₁ , and the entity category set as C={knowledge point}; the relationship set between knowledge point entities is R={inclusion relationship, dependency relationship}, the inclusion relationship is denoted as R ₁ , and the dependency relationship is denoted as R ₂ . The number of entity categories in the knowledge graph to be built is M=1, and the number of relationship categories is N=2.

1.2 根据教材目录，抽取C₁的实体集合E₁={基本数据类型，控制结构，数字类型，分支结构，整数类型，单分支结构，二进制，十进制，八进制，十六进制，布尔类型}；1.2 According to the textbook catalog, extract the entity set E ₁ of C ₁ = {basic data type, control structure, digital type, branch structure, integer type, single branch structure, binary, decimal, octal, hexadecimal, Boolean type};

1.3 将知识点 C₁类别之间存在的包含关系R₁和依赖关系R₂，对应的实体类别三元组和对应的实体集合，具体记为：1.3 The inclusion relationship R ₁ and dependency relationship R ₂ between the knowledge point C ₁ categories, the corresponding entity category triples and the corresponding entity sets are specifically recorded as:

1.3.1 包含关系R₁对应的实体类别三元组记为（知识点,包含关系,知识点），头实体集合记为E_h1={基本数据类型，控制结构，数字类型，分支结构，整数类型}，E_t1={数字类型，分支结构，整数类型，单分支结构，二进制，十进制，八进制，十六进制}；1.3.1 The entity category triple corresponding to the inclusion relation _R1 is recorded as (knowledge point, inclusion relation, knowledge point), and the head entity set is recorded as _Eh1 = {basic data type, control structure, digital type, branch structure, integer type}, _Et1 = {digital type, branch structure, integer type, single branch structure, binary, decimal, octal, hexadecimal};

1.3.2 依赖关系R₂对应的实体类别三元组记为（知识点，依赖关系，知识点），头实体集合记为E_h2={控制结构}，E_t2={基本数据类型，布尔类型}；1.3.2 The entity category triple corresponding to the dependency relationship _R2 is recorded as (knowledge point, dependency relationship, knowledge point), and the head entity set is recorded as _Eh2 = {control structure}, _Et2 = {basic data type, Boolean type};

2. 利用模式层及实体集构建实体关系三元组示例集，以实体类别三元组（知识点，包含关系，知识点）和（知识点，依赖关系，知识点）为例，具体操作如下：2. Use the pattern layer and entity set to construct an example set of entity relationship triples. Take the entity category triples (knowledge point, inclusion relationship, knowledge point) and (knowledge point, dependency relationship, knowledge point) as examples. The specific operations are as follows:

2.1 以满足包含关系R₁关系为条件，从实体集合E_h1和实体集合E_t1中选取2个实体对；以满足依赖关系R₂关系为条件，从实体集合E_h2和实体集合E_t2中选取2个实体对；2.1 Select two entity pairs from entity set E _h1 and entity set E _t1 under the condition of satisfying the inclusion relationship R ₁ ; select two entity pairs from entity set E _h2 and entity set E _t2 under the condition of satisfying the dependency relationship R ₂ ;

2.2 包含关系R₁的实体关系三元组实例集S₁={（控制结构，包含关系，分支结构），(分支结构，包含关系，二分支结构)}；依赖关系R₂的实体关系三元组实例集S₂={（控制结构，依赖关系关系，基本数字类型），(控制结构，依赖关系，布尔类型)}；2.2 The entity relationship triple instance set S ₁ of the inclusion relation R ₁ ={(control structure, inclusion relation, branch structure), (branch structure, inclusion relation, two-branch structure)}; the entity relationship triple instance set S ₂ of the dependency relation R ₂ ={(control structure, dependency relation, basic numeric type), (control structure, dependency relation, Boolean type)};

3. 构建实体向量数据库DB，具体步骤如下：3. Build the entity vector database DB. The specific steps are as follows:

3.1 对实体集E₁里的每个实体e_1j∈E₁(j∈[1,11])，使用BERT编码为实体向量ve_1j(j∈[1,11])；3.1 For each entity e _1j ∈E ₁ (j∈[1,11]) in the entity set E ₁ , use BERT to encode it into an entity vector ve _1j (j∈[1,11]);

3.2 将每一对实体e_1j和实体向量ve_1j存入实体向量数据库DB；3.2 Store each pair of entity e _1j and entity vector ve _1j into the entity vector database DB;

4. 构建prompt模板集PR，具体步骤如下：4. Build the prompt template set PR. The specific steps are as follows:

4.1 定义模板集PR={pr₁,pr₂,pr₃}；4.1 Define the template set PR = {pr ₁ ,pr ₂ ,pr ₃ };

4.2 构建每个提示模板pr_k∈PR，i∈[1,3]，模板中包含若干模板变量，具体步骤如下：4.2 Construct each prompt template pr _k ∈ PR, i ∈ [1, 3], the template contains several template variables, the specific steps are as follows:

4.2.1 定义pr₁的模板内容：“你是特定领域${field}的资深专家，给定一个段落${P_i}，并给定段落中的实体：${e₁}、……、实体${e_j}、……、实体${e_m}，请抽取出这段话中的实体关系三元组；三元组的格式为（头实体，关系，尾实体）；实体之间只允许存在以下几种关系：${R₁}、${R₂}；其中R₁关系是${explainR₁}，R₂关系是${explainR₂}”；模板包含变量：field（领域名称）、P_i（段落）、e₁~e_m（段落P_i中对应的实体）、R₁~R₂（待建知识图谱中的关系）和explainR₁~explainR₂（根据关系R₁~R₂对应的实体关系三元组示例集，构建关系R₁~R₂的示例说明）；4.2.1 Define the template content of pr ₁ : "You are a senior expert in a specific field ${field}. Given a paragraph ${P _i } and the entities in the paragraph: ${e ₁ }, ..., entity ${e _j }, ..., entity ${e _m }, please extract the entity-relationship triples in this paragraph; the format of the triples is (head entity, relationship, tail entity); only the following relationships are allowed between entities: ${R ₁ }, ${R ₂ }; the R ₁ relationship is ${explainR ₁ }, and the R ₂ relationship is ${explainR ₂ }"; the template contains variables: field (field name), _Pi (paragraph), e ₁ ~e _m (the corresponding entities in paragraph _Pi ), R ₁ ~R ₂ (the relationships in the knowledge graph to be built) and explainR ₁ ~explainR ₂ (according to the example set of entity-relationship triples corresponding to the relationships R ₁ ~R ₂ , construct an example description of the relationships R ₁ ~R ₂ );

4.2.2 定义pr₂的模板内容：“请以${field}领域专业人士的角度对段落${Pi}进行实体关系三元组抽取，要求如下：1.三元组的实体必须是以下几个：${e₁}、……、实体${e_j}、……、实体${e_m}；2.三元组的关系必须是以下几种：${R₁}、${R₂}；3.参考以下示例：关系${R₁}的示例${S₁}、关系${R₂}的示例${S₂}”；模板包含变量：field（领域名称）、P_i（段落）、e₁~e_m（段落P_i中对应的实体）、R₁~R₂（待建知识图谱中的关系）和S₁~S₂（关系R₁~R₂对应的实体关系三元组示例集S₁~S₂）；4.2.2 Define the template content of pr ₂ : "Please extract entity-relationship triples from paragraph ${Pi} from the perspective of professionals in the ${field} field, with the following requirements: 1. The entities of the triples must be the following: ${e ₁ }, ..., entity ${e _j }, ..., entity ${e _m }; 2. The relations of the triples must be the following: ${R ₁ }, ${R ₂ }; 3. Refer to the following examples: example ${S ₁ } of relation ${R ₁ }, example ${S _{2 } of relation ${R 2} _} "; the template contains variables: field (field name), _Pi (paragraph), e ₁ ~e _m (entities corresponding to paragraph _Pi ), R ₁ ~R ₂ (relations in the knowledge graph to be built) and S ₁ ~S ₂ (example set S ₁ ~S ₂ of entity-relationship triples corresponding to relations R ₁ ~R ₂ );

4.2.3 定义pr₃的模板内容：请基于以下信息，为指定段落文本进行实体关系三元组抽取：段落：${Pi}；实体：${e₁}、……、实体${e_j}、……、实体${e_m}；关系：${R₁}、${R₂}；三元组实例：${S₁}、${S₂}；”；模板包含变量：field（领域名称）、P_i（段落）、e₁~e_m（段落P_i中对应的实体）、R₁~R₂（待建知识图谱中的关系）和S₁~S₂（关系R₁~R₃对应的实体关系三元组示例集S₁~S₂）；4.2.3 Define the template content of pr ₃ : Please extract entity-relationship triples for the specified paragraph text based on the following information: paragraph: ${Pi}; entity: ${e ₁ }, ..., entity ${e _j }, ..., entity ${e _m }; relationship: ${R ₁ }, ${R ₂ }; triple instance: ${S ₁ }, ${S ₂ };"; The template contains variables: field (field name), _Pi (paragraph), e ₁ ~e _m (the corresponding entity in paragraph _Pi ), R ₁ ~R ₂ (the relationship in the knowledge graph to be built) and S ₁ ~S ₂ (the entity-relationship triple example set S ₁ ~S ₂ corresponding to the relationship R ₁ ~R ₃ );

5. 将待处理文本划分为段落，并提取段落的关键词列表，具体步骤如下：5. Divide the text to be processed into paragraphs and extract the keyword list of the paragraphs. The specific steps are as follows:

5.1对待处理的非结构化文本进行段落划分，具体步骤如下：5.1 Divide the unstructured text to be processed into paragraphs. The specific steps are as follows:

5.1.1 清洗文本数据，去除无用的标点符号、特殊字符等；5.1.2 使用BERT对文本进行词向量表示，生成词向量序列；5.1.3 将词向量序列输入到BiLSTM-CRF模型，对文本进行主题边界检测，得到每个段落的起止位置；5.1.4 根据检测的主题边界，将文本划分成不同的段落，得到段落集合P={P₀,…P_i,…,P₁₀}(i∈[1,10])；5.1.1 Clean the text data and remove useless punctuation marks, special characters, etc.; 5.1.2 Use BERT to represent the text as word vectors and generate a word vector sequence; 5.1.3 Input the word vector sequence into the BiLSTM-CRF model, perform topic boundary detection on the text, and obtain the start and end positions of each paragraph; 5.1.4 Divide the text into different paragraphs based on the detected topic boundaries, and obtain the paragraph set P={P ₀ ,…P _i ,…,P ₁₀ }(i∈[1,10]);

5.2 提取每个段落P_i（i∈[1,10]）中的关键词，以段落P₄为例，具体步骤如下：5.2 Extract keywords from each paragraph _Pi (i∈[1,10]). Taking paragraph _P4 as an example, the specific steps are as follows:

5.2.1利用通义千问大模型提取段落P₄：“整数类型与数学中整数的概念一致，下面是整数类型的例子：1010，99，-217，0x9a，-0x89。整数类型共有4种进制表示：十进制、二进制、八进制和十六进制。默认情况整数采用十进制，其他进制需要增加引导符号。二进制数以0b引导八进制数以0o引导，十六进制数以0x引导，大小写字母均可使用。”的关键词，加入集合K₄={整数类型，整数，整数概念，进制表示，十进制，二进制，八进制，十六进制，引导符号}；5.2.1 Use the Tongyi Qianwen model to extract the keywords of paragraph P ₄ : "Integer type is consistent with the concept of integer in mathematics. The following are examples of integer types: 1010, 99, -217, 0x9a, -0x89. Integer type has four bases: decimal, binary, octal and hexadecimal. By default, integers use decimal. Other bases require a leading symbol. Binary numbers are led by 0b, octal numbers are led by 0o, and hexadecimal numbers are led by 0x. Both uppercase and lowercase letters can be used." and add the keywords to the set K ₄ ={integer type, integer, integer concept, base representation, decimal, binary, octal, hexadecimal, leading symbol};

5.2.2 将段落P₄与关键词集合K₄组合为段落P₄的关键词列表PL₄=[P₄, K₄]；5.2.2 Combine paragraph P ₄ and keyword set K ₄ into keyword list PL ₄ =[P ₄ , K ₄ ] of paragraph P ₄ ;

6. 利用关键词列表为每个段落构造段落-实体关联列表T，以段落P₄为例，具体步骤如下：6. Use the keyword list to construct a paragraph-entity association list T for each paragraph. Taking paragraph P ₄ as an example, the specific steps are as follows:

6.1 将K₄中的每个关键词k_4j∈K₄(j∈[1,8])，使用BERT将其编码为关键词向量vk_4j；6.1 Encode each keyword k _4j ∈K ₄ (j∈[1,8]) in K ₄ into a keyword vector vk _4j using BERT;

6.2 将关键词向量vk_4j在实体向量库中进行相似度搜索，获取Top-1个相似的实体，以关键词“整数类型”为例，获取Top-1个相似的实体，也是 “整数类型”，记为Top-1[k_4j]={整数类型}；6.2 Perform similarity search on the keyword vector vk _4j in the entity vector library to obtain the top-1 similar entities. Taking the keyword "integer type" as an example, the top-1 similar entity is also "integer type", recorded as Top-1[k _4j ]={integer type};

6.3 将K₄里的所有关键词对应的所有相似实体，去重，构成实体列表L₄=[整数类型，十进制，二进制，八进制，十六进制]；6.3 Remove duplicates from all similar entities corresponding to all keywords in K ₄ to form an entity list L ₄ = [integer type, decimal, binary, octal, hexadecimal];

6.4 将段落P₄、实体列表L₄组合为P₄的段落-实体关联列表T₄=[P₄, L₄]；6.4 Combine paragraph P ₄ and entity list L ₄ into P ₄ 's paragraph-entity association list T ₄ = [P ₄ , L ₄ ];

7. 对每个段落，利用关联列表T、prompt模板集PR和实体关系三元组示例集构造动态提示，以段落P₄的关联列表T₄、模板pr₁和实体关系三元组示例集为例，具体步骤如下：7. For each paragraph, construct a dynamic prompt using the association list T, the prompt template set PR, and the entity relationship triplet example set. Take the association list T ₄ , template pr ₁ , and the entity relationship triplet example set of paragraph P ₄ as an example. The specific steps are as follows:

7.1 将提示模板pr₁中的模板变量，替换为对应的变量值，具体步骤如下：7.1 Replace the template variables in the prompt template pr ₁ with the corresponding variable values. The specific steps are as follows:

7.1.1 将关联列表T₄中的段落P₄代入段落变量${P_i}；7.1.1 Substitute paragraph P ₄ in association list T ₄ into paragraph variable ${P _i };

7.1.2 将将关联列表T₄中的实体列表L₄的每个实体分别代入实体变量${e₁}~${e₅}；7.1.2 Substitute each entity of the entity list L ₄ in the association list T ₄ into the entity variables ${e ₁ }~${e ₅ };

7.1.3 将实体列表L₄的实体具有的关系R₁~R₂分别代入关系变量${R₁} ~${R₂}；7.1.3 Substitute the relations R ₁ ~R ₂ of the entities in the entity list L ₄ into the relation variables ${R ₁ } ~${R ₂ } respectively;

7.1.4 将关系R₁~R₂对应的实体三元组示例集S₁~S₂里的实体三元组，代入实体关系三元组示例集变量${S₁} ~${S₂}；7.1.4 Substitute the entity triples in the entity triple example set S ₁ ~S ₂ corresponding to the relations R ₁ ~R ₂ into the entity relation triple example set variables ${S ₁ } ~${S ₂ };

7.2 对于模板中的剩余变量，按照需要输入模板变量对应的值，形成一个完整的动态提示内容Q_kP₄(k∈[1,3])，以pr₁为例，输入“Python程序设计语言”代入领域变量${field}；输入“指两个知识点属于同一类型的概念且前者所指的范围将后者包含着，如控制结构与分支结构的关系”，代入模板变量${explainR₁}；输入“指学习或应用某个知识点时，必须先掌握或理解另一个或多个知识点，如学习控制结构之前，必须先掌握布尔类型这一知识点”，代入模板变量${explainR₂}；7.2 For the remaining variables in the template, enter the values corresponding to the template variables as needed to form a complete dynamic prompt content Q _k P ₄ (k∈[1,3]). Taking pr ₁ as an example, enter "Python programming language" and substitute it into the field variable ${field}; enter "refers to the concept that two knowledge points belong to the same type and the scope referred to by the former includes the latter, such as the relationship between the control structure and the branch structure", and substitute it into the template variable ${explainR ₁ }; enter "refers to the fact that when learning or applying a certain knowledge point, one must first master or understand another or more knowledge points, such as before learning the control structure, one must first master the Boolean type knowledge point", and substitute it into the template variable ${explainR ₂ };

因此动态提示内容Q₁P₄为：你是特定领域Python程序设计语言的资深专家，给定一个段落“整数类型与数学中整数的概念一致，下面是整数类型的例子：1010，99，-217，0x9a，-0x89。整数类型共有4种进制表示：十进制、二进制、八进制和十六进制。默认情况整数采用十进制，其他进制需要增加引导符号。二进制数以0b引导八进制数以0o引导，十六进制数以0x引导，大小写字母均可使用。”，并给定段落中的实体：整数、十进制、二进制、八进制、十六进制，请抽取出这段话中的实体关系三元组；三元组的格式为（头实体，关系，尾实体）；实体之间只允许存在以下几种关系：包含关系、依赖关系；其中包含关系是指两个知识点属于同一类型的概念且前者所指的范围将后者包含着，如控制结构与分支结构的关系，依赖关系是指学习或应用某个知识点时，必须先掌握或理解另一个或多个知识点，如学习控制结构之前，必须先掌握布尔类型这一知识点；Therefore, the dynamic prompt content Q ₁ P ₄ is: You are a senior expert in the Python programming language in a specific field. Given a paragraph "The integer type is consistent with the concept of integers in mathematics. The following are examples of integer types: 1010, 99, -217, 0x9a, -0x89. Integer types have 4 bases: decimal, binary, octal, and hexadecimal. By default, integers use decimal, and other bases require a leading symbol. Binary numbers are led by 0b, octal numbers are led by 0o, and hexadecimal numbers are led by 0x. Both uppercase and lowercase letters can be used.", and given the entity in the paragraph: integer Number, decimal, binary, octal, hexadecimal, please extract the entity relationship triples in this paragraph; the format of the triple is (head entity, relationship, tail entity); only the following relationships are allowed between entities: inclusion relationship, dependency relationship; inclusion relationship means that two knowledge points belong to the same type of concept and the scope of the former includes the latter, such as the relationship between control structure and branch structure; dependency relationship means that when learning or applying a certain knowledge point, you must first master or understand another one or more knowledge points, such as before learning control structure, you must first master the knowledge point of Boolean type;

同理，可得动态提示内容Q₂P₄为：请以Python程序设计语言领域专业人士的角度对段落“整数类型与数学中整数的概念一致，下面是整数类型的例子：1010，99，-217，0x9a，-0x89。整数类型共有4种进制表示：十进制、二进制、八进制和十六进制。默认情况整数采用十进制，其他进制需要增加引导符号。二进制数以0b引导八进制数以0o引导，十六进制数以0x引导，大小写字母均可使用。”进行实体关系三元组抽取，要求如下：1.三元组的实体必须是以下几种：整数、十进制、二进制、八进制、十六进制，2.三元组的关系必须是以下几种：包含关系、依赖关系，3.参考以下示例：包含关系的示例(控制结构，包含关系，分支结构)，(分支结构，包含关系，二分支结构)，依赖关系的示例(控制结构，依赖关系，基本数据类型)，(控制结构，依赖关系，布尔类型)；Similarly, we can get the dynamic prompt content Q ₂ P ₄ : Please use the perspective of professionals in the field of Python programming language to extract entity relationship triples from the paragraph "Integer types are consistent with the concept of integers in mathematics. The following are examples of integer types: 1010, 99, -217, 0x9a, -0x89. Integer types are represented in 4 bases: decimal, binary, octal, and hexadecimal. By default, integers are in decimal, and other bases require a leading symbol. Binary numbers are led by 0b, octal numbers are led by 0o, and hexadecimal numbers are led by 0x. Both uppercase and lowercase letters can be used." The requirements are as follows: 1. The entities of the triples must be the following: integer, decimal, binary, octal, hexadecimal, 2. The relationships of the triples must be the following: inclusion relationship, dependency relationship, 3. Refer to the following examples: Examples of inclusion relationships (control structure, inclusion relationship, branch structure), (branch structure, inclusion relationship, binary branch structure), examples of dependency relationships (control structure, dependency relationship, basic data type), (control structure, dependency relationship, Boolean type);

可得动态提示内容Q₃P₄为：请基于以下信息，为指定段落文本进行实体关系三元组抽取：段落：“整数类型与数学中整数的概念一致，下面是整数类型的例子：1010，99，-217，0x9a，-0x89。整数类型共有4种进制表示：十进制、二进制、八进制和十六进制。默认情况整数采用十进制，其他进制需要增加引导符号。二进制数以0b引导八进制数以0o引导，十六进制数以0x引导，大小写字母均可使用。”；实体：整数、十进制、二进制、八进制、十六进制；关系：包含关系、依赖关系；三元组实例：(控制结构，包含关系，分支结构)，(分支结构，包含关系，二分支结构)，(控制结构，依赖关系，基本数据类型)，(控制结构，依赖关系，布尔类型)；The dynamic prompt content Q ₃ P ₄ is: Please extract entity relationship triples for the specified paragraph text based on the following information: Paragraph: "The integer type is consistent with the concept of integers in mathematics. The following are examples of integer types: 1010, 99, -217, 0x9a, -0x89. There are 4 bases for integer types: decimal, binary, octal and hexadecimal. By default, integers use decimal, and other bases need to add leading symbols. Binary numbers are led by 0b, octal numbers are led by 0o, and hexadecimal numbers are led by 0x. Both uppercase and lowercase letters can be used."; Entities: integers, decimal, binary, octal, hexadecimal; Relationships: inclusion relations, dependency relations; Triple instances: (control structure, inclusion relations, branch structure), (branch structure, inclusion relations, two-branch structure), (control structure, dependency relations, basic data types), (control structure, dependency relations, Boolean type);

8. 将动态提示内容送入大模型进行实体关系抽取，具体步骤如下：8. Send the dynamic prompt content to the big model for entity relationship extraction. The specific steps are as follows:

8.1 通过调用API的方式与大模型对话，将步骤7生成的动态提示内容分别送入ChatGPT-4和Claude-3-opus中进行实体关系抽取，以动态提示内容Q₁P₄、Q₂P₄和Q₃P₄为例，ChatGPT-4和Claude-3-opus的实体关系抽取结果如表1所示；8.1 By calling the API to communicate with the big model, the dynamic prompt content generated in step 7 is sent to ChatGPT-4 and Claude-3-opus for entity relationship extraction. Taking the dynamic prompt content Q ₁ P ₄ , Q ₂ P ₄ and Q ₃ P ₄ as examples, the entity relationship extraction results of ChatGPT-4 and Claude-3-opus are shown in Table 1;

表1 实体关系抽取结果Table 1 Entity relationship extraction results

ChatGPT-4ChatGPT-4 Claude-3-opusClaude-3-opus Q₁P₄ Q ₁ P ₄ 基于您的要求，以下是从段落中抽取的实体关系三元组：包含关系：(整数,包含关系, 十进制)(整数, 包含关系, 二进制)(整数, 包含关系, 八进制)(整数, 包含关系, 十六进制)依赖关系：(二进制, 依赖关系, 引导符号0b)(八进制, 依赖关系,引导符号0o)(十六进制,依赖关系, 引导符号0x)这些三元组精确地描述了段落中涉及的实体之间的关系，并满足了您的要求。Based on your requirements, the following are the entity relationship triplets extracted from the paragraph: Inclusion relationship: (integer, inclusion relationship, decimal) (integer, inclusion relationship, binary) (integer, inclusion relationship, octal) (integer, inclusion relationship, hexadecimal) Dependency relationship: (binary, dependency relationship, leading symbol 0b) (octal, dependency relationship, leading symbol 0o) (hexadecimal, dependency relationship, leading symbol 0x) These triplets accurately describe the relationship between the entities involved in the paragraph and meet your requirements. 根据给定的段落和实体，我抽取出以下实体关系三元组：(整数类型, 包含,十进制)(整数类型, 包含, 二进制)(整数类型,包含, 八进制)(整数类型, 包含, 十六进制)(二进制, 依赖, 0b引导)(八进制, 依赖, 0o引导)(十六进制, 依赖, 0x引导)这些三元组表示：整数类型包含十进制、二进制、八进制和十六进制四种进制表示。二进制数依赖于0b引导符号。八进制数依赖于0o引导符号。十六进制数依赖于0x引导符号。Based on the given paragraphs and entities, I extracted the following entity relationship triples: (integer type, including, decimal) (integer type, including, binary) (integer type, including, octal) (integer type, including, hexadecimal) (binary, dependent, 0b leading) (octal, dependent, 0o leading) (hexadecimal, dependent, 0x leading) These triples indicate that the integer type contains four base representations: decimal, binary, octal, and hexadecimal. Binary numbers depend on the 0b leading symbol. Octal numbers depend on the 0o leading symbol. Hexadecimal numbers depend on the 0x leading symbol. Q₂P₄ Q ₂ P ₄ 基于提供的段落和给定的要求，我们可以从文本中抽取出以下实体关系三元组：(整数, 包含关系, 十进制)(整数, 包含关系,二进制)(整数, 包含关系, 八进制)(整数, 包含关系, 十六进制)这四个三元组表示整数包含了不同的进制表示。每个进制都是整数的一种表现形式。(二进制, 依赖关系,0b引导符)(八进制, 依赖关系, 0o引导符)(十六进制, 依赖关系, 0x引导符)这三个三元组表示不同的进制依赖于特定的引导符来表示，在Python中，0b、0o、0x是用来表示二进制、八进制和十六进制的引导符号。请注意，文本中并没有直接提及十进制的特定引导符，因为默认情况下，不带任何前缀的数字被解释为十进制，因此没有创建依赖于特定引导符的三元组。Based on the provided paragraph and given requirements, we can extract the following entity-relationship triples from the text: (integer, inclusion, decimal)(integer, inclusion, binary)(integer, inclusion, octal)(integer, inclusion, hexadecimal)These four triples indicate that integers contain different base representations. Each base is a representation of an integer. (binary, dependency, 0b leader)(octal, dependency, 0o leader)(hexadecimal, dependency, 0x leader)These three triples indicate that different bases depend on specific leaders to represent. In Python, 0b, 0o, and 0x are the leader symbols used to represent binary, octal, and hexadecimal. Note that the specific leader of the decimal system is not directly mentioned in the text because, by default, numbers without any prefix are interpreted as decimal, so no triples that depend on specific leaders are created. 根据给定的段落和要求，我抽取出以下实体关系三元组：1. (整数, 包含关系, 十进制)2. (整数,包含关系, 二进制)3.(整数, 包含关系, 八进制) 4. (整数, 包含关系, 十六进制)5. (二进制, 依赖关系, 0b引导)6. (八进制, 依赖关系,0o引导)7. (十六进制,依赖关系, 0x引导)以上三元组满足给定的实体和关系限制条件。整数类型包含了十进制、二进制、八进制和十六进制四种表示方式。而二进制、八进制、十六进制又分别依赖于0b、0o、0x的引导符号。According to the given paragraphs and requirements, I extracted the following entity-relationship triples: 1. (integer, inclusion, decimal) 2. (integer, inclusion, binary) 3. (integer, inclusion, octal) 4. (integer, inclusion, hexadecimal) 5. (binary, dependency, 0b leading) 6. (octal, dependency, 0o leading) 7. (hexadecimal, dependency, 0x leading) The above triples meet the given entity and relationship constraints. The integer type includes four representations: decimal, binary, octal and hexadecimal. Binary, octal and hexadecimal rely on the leading symbols 0b, 0o and 0x respectively. Q₃P₄ Q ₃ P ₄ 根据您提供的信息，我们可以对指定段落进行实体关系三元组的抽取。在这个段落中，关键实体有“整数”、“十进制”、“二进制”、“八进制”、“十六进制”，而关系类型则包括“包含关系”、“同一关系”和“依赖关系”。这里是如何构建这些实体和关系的一些三元组示例：1. 包含关系(整数，包含关系，十进制)(整数，包含关系，二进制)(整数，包含关系，八进制(整数，包含关系，十六进制)2. 依赖关系(二进制, 依赖关系, 0b引导符)(八进制, 依赖关系, 0o引导符)(十六进制, 依赖关系, 0x引导符)表明非十进制的表示需要依赖特定的引导符号，如0b、0o、0x。这些三元组能够帮助描述和理解整数类型在不同进制表示下的结构和特征。每个三元组均准确反映了段落中提到的信息，如整数类型的不同表示方法及其依赖关系。Based on the information you provided, we can extract entity-relationship triples for the specified paragraph. In this paragraph, the key entities are "integer", "decimal", "binary", "octal", and "hexadecimal", while the relationship types include "inclusion relationship", "identity relationship", and "dependency relationship". Here are some examples of how to build these entity and relationship triples:1. Inclusion relationship (integer, inclusion relationship, decimal)(integer, inclusion relationship, binary)(integer, inclusion relationship, octal(integer, inclusion relationship, hexadecimal)2. Dependency relationship (binary, dependency relationship, 0b leader)(octal, dependency relationship, 0o leader)(hexadecimal, dependency relationship, 0x leader)Indicates that non-decimal representations require specific leader symbols, such as 0b, 0o, and 0x. These triples can help describe and understand the structure and characteristics of integer types in different base representations. Each triple accurately reflects the information mentioned in the paragraph, such as different representation methods of integer types and their dependencies. 好的,根据给定的段落和实体,我尝试抽取三元组关系如下:(整数, 包含关系, 十进制)(整数, 包含关系, 二进制)(整数, 包含关系, 八进制)(整数,包含关系, 十六进制)(二进制, 依赖关系, 0b)(八进制, 依赖关系, 0o)(十六进制, 依赖关系, 0x)这些三元组涵盖了段落中提到的整数类型与不同进制表示形式之间的包含关系。整数类型包含四种进制表示，这四种进制有各自对应的表示方法，而二进制、八进制和十六进制的表示都依赖于特定的引导符号。OK, given the paragraph and entity, I try to extract the following triples: (integer, inclusion, decimal) (integer, inclusion, binary) (integer, inclusion, octal) (integer, inclusion, hexadecimal) (binary, dependency, 0b) (octal, dependency, 0o) (hexadecimal, dependency, 0x) These triplets cover the inclusion relationship between the integer types mentioned in the paragraph and different base representations. The integer type contains four base representations, and these four bases have their own corresponding representation methods, and the binary, octal and hexadecimal representations all rely on specific leading symbols.

8.2 对比分析两个大模型的实体关系抽取结果，选取最优抽取结果，具体操作如下：8.2 Compare and analyze the entity relationship extraction results of the two large models and select the optimal extraction result. The specific operations are as follows:

对于包含关系，两个大模型在三个动态提示下的实体关系抽取结果基本一致；对于依赖关系，两个大模型在三个动态提示下抽取的尾实体在表达上略有不同，需对依赖关系的实体关系三元组进行检验，选取最优抽取结果。For inclusion relationships, the entity relationship extraction results of the two large models under three dynamic prompts are basically the same; for dependency relationships, the tail entities extracted by the two large models under three dynamic prompts are slightly different in expression, and it is necessary to test the entity relationship triplets of the dependency relationship and select the optimal extraction result.

本发明实现了一套完整的面向特定领域非结构化文本的实体关系抽取方法，从构建领域知识图谱的模式层和实体集开始，动态构建prompt所需要的各种要素，并生成动态提示，引导大模型进行特定领域的实体关系抽取；利用预训练语言模型构建实体向量数据库，提升了段落文本中实体识别和关系抽取的准确性。本发明融合了知识图谱、实体向量表示、动态提示与大模型的优势，无需微调大模型，实现实体关系的自动抽取，降低了实体关系抽取的成本，提升了效率和正确率，具有广泛的应用推广价值。The present invention realizes a complete entity relationship extraction method for unstructured text in a specific field. Starting from the pattern layer and entity set of the domain knowledge graph, the various elements required for prompt are dynamically constructed, and dynamic prompts are generated to guide the large model to extract entity relationships in a specific field; the entity vector database is constructed using a pre-trained language model to improve the accuracy of entity recognition and relationship extraction in paragraph text. The present invention combines the advantages of knowledge graph, entity vector representation, dynamic prompt and large model, without fine-tuning the large model, to achieve automatic extraction of entity relationships, reduce the cost of entity relationship extraction, improve efficiency and accuracy, and has a wide range of application and promotion value.

Claims

1. A method for extracting entity relationships based on a large model and dynamic prompts, the steps of which include:

A. Define the model layer and entity set of the domain knowledge graph. The specific steps are as follows:

A1. Define the set of entity categories C={C ₁ ,…C _i ,…C _N }(i∈[1,N]), the set of relationships between entity categories R={R ₁ ,…R _j ,…R _M }(j∈[1,M]), where N and M represent the number of entity categories and the number of relationships in the knowledge graph to be built;

A2. Extract words belonging to entity category C _i (i∈[1,N]) from the industry terminology dictionary or catalog as entities to form the entity set E _i ={e _i1 ,…e _ij ,…,e _ik } of this category (i∈[1,N], j∈[1,k], k represents the number of entities in the entity set, which varies depending on the entity category C _i );

A3. The entity category _Ch (h∈[1,N]) and entity category C _t (t∈[1,N]) with the relation R _j (j∈[1,M]) are recorded as entity category triples ( _Ch , R _j , C _t ), where _Ch is called the head entity category and its entity set is recorded as E _hj , C _t is called the tail entity category and its entity set is recorded as E _tj ; all entity category triplets constitute the pattern layer of the knowledge graph;

B. Use the pattern layer and entity set to construct an example set of entity relationship triples. Take the entity category triple (C _h , R _j , C _t ) as an example. The specific operations are as follows:

B1. Select F entity pairs from entity set E _h and entity set E _t under the condition that they satisfy the R _j relationship;

B2. Combine each entity pair and relation R _j into an entity-relation triple (e _hi ,R _j ,e _ti ), where entity e _hi ∈E _h (i∈[1,F]), e _ti ∈E _t (i∈[1,F]);

B3. Using the F entity-relationship triples constructed in step B2, construct an example set of entity-relationship triples S _j ={(e _hi ,R _j ,e _ti ) for i in [1:F]} of relation R _j , where for i in [1:F] indicates that the value of i is an integer from 1 to F;

C. Construct entity vector database DB. The specific steps are as follows:

C1. For each entity e _ij ∈E _i (i∈[1,N],j∈[1,|Ei|]) in the entity set E _i (i∈[1,N]), encode it into an entity vector ve _ij using the pre-trained language model;

C2. storing each pair of entity e _ij and entity vector ve _ij into the entity vector database DB;

D. Build the prompt template set PR. The specific steps are as follows:

D1. Define the template set PR = {pr ₁ ,…,pr _k ,…,pr _n } (k∈[1,n]), where n represents the number of templates contained in the template set PR;

D2. Construct each prompt template pr _k ∈ PR, where the template contains several template variables;

E. Divide the text to be processed into paragraphs and extract the keyword list of the paragraphs. The specific steps are as follows:

E1. Divide the unstructured text to be processed into paragraphs;

E2. Extract the keyword list in each paragraph _Pi (i∈[1,t]);

F. Use the keyword list to construct a paragraph-entity association list T for each paragraph. Taking paragraph _Pi as an example, the specific steps are as follows:

F1. Encode each keyword k _ij ∈K _i (j∈[1,p]) in K _i into a keyword vector vk _ij using the pre-trained language model in step C1;

F2. Perform similarity search on the keyword vector vk _ij in the entity vector library to obtain the Top-N similar entities, denoted as Top-N[k _ij ]={e ₁ ,e ₂ ,…,e _N };

F3. Remove duplicates from all similar entities corresponding to all keywords in K _i to form an entity list L _i ;

F4. Combine paragraph _Pi and entity list _Li into _Pi 's paragraph-entity association list _Ti = [ _Pi , _Li ];

G. For each paragraph, construct a dynamic prompt using the association list T, the prompt template set PR, and the entity relationship triple example set;

G1. Replace the template variables in the prompt template pr _k with the corresponding variable values;

G2. For the remaining variables in the template, enter the corresponding values according to the definition of the template variables to form a complete dynamic prompt content Q _k P _i (k∈[1,n],i∈[1,t]). For example, enter field (field name) and substitute it into the field variable ${field};

H. Send the dynamic prompt content to the big model for entity relationship extraction and verify the correctness of the results. The specific steps are as follows:

H1. Communicate with the big model by calling the API, and send the dynamic prompt content to multiple big models for entity relationship extraction;

H2. Compare and analyze the entity relationship extraction results of multiple large models and select the optimal extraction result.

2. The entity relationship extraction method based on a large model and dynamic prompts as claimed in claim 1, constructing each prompt template prk∈PR, wherein the template contains a number of template variables, and the specific steps are as follows:

D2.1 defines a domain variable as ${field}, where the value range of field is the set of domain identifiers;

D2.2 defines the paragraph variable as ${P _i }, where the value range of P _i is the set of paragraph tags;

D2.3 Define entity variables as ${e ₁ }, …, ${e _j }, …, ${e _m }, where e ₁ ~e _m are the corresponding entities in the paragraph;

D2.4 Define the relationship variables as ${R ₁ }, …, ${R _j }, …, ${R _M }, where R ₁ ~R _M are all the relationships in the knowledge graph to be built;

D2.5 defines the entity-relationship triplet example set variables as ${S ₁ }, …, ${S _j }, …, ${S _M }, where S ₁ ~S _m is the entity-relationship triplet example set corresponding to relations R ₁ ~R _M.

3. The entity relationship extraction method based on a large model and dynamic prompts as claimed in claim 1, wherein the unstructured text to be processed is divided into paragraphs, and the specific steps are as follows:

E1.1 Clean the text data and remove useless punctuation marks, special characters, etc.; E1.2 Use the pre-trained word vector model to represent the text with word vectors and generate word vector sequences; E1.3 Input the word vector sequence into the neural network sequence labeling model to detect the topic boundaries of the text and obtain the start and end positions of each paragraph;

E1.4 Based on the detected topic boundaries, the text is divided into different paragraphs, and the paragraph set P = {P ₀ ,…P _i ,…,P _t } (i∈[1,t]) is obtained, where t is the total number of paragraphs in the set.

4. The entity relationship extraction method based on a large model and dynamic prompts as claimed in claim 1, extracting a keyword list in each paragraph Pi (i∈[1,t]), the specific steps are as follows:

E2.1 Extract keywords from each paragraph P _i (i∈[1,t]) and add them to the set K _i ={k _i1 ,k _i2 ,...,k _ip } (p is the total number of keywords in P _i ). Keyword extraction methods include but are not limited to TF-IDF, TextRank, and methods based on pre-trained language models;

E2.2 Combine paragraph _Pi and keyword set _Ki into keyword list PL _i = [P _i , _Ki ] of paragraph _Pi .

5. The entity relationship extraction method based on a large model and dynamic prompts as claimed in claim 1, wherein the template variables in the prompt template prk are replaced with corresponding variable values, and the specific steps are as follows:

G1.1 Substitute the paragraph _Pi in the association list _Ti into the paragraph variable ${ _Pi };

G1.2 Substitute each entity of the entity list _Li in the association list _Ti into the entity variables ${e ₁ }~${e _m };

G1.3 Substitute the relations R ₁ ~R _j of the entities in the entity list _Li into the relation variables ${R ₁ }~${R _j };

G1.4 Substitute the entity relationship triple example set in the entity triple example set S ₁ ~S _j corresponding to the relation R ₁ ~R _j into the entity relationship triple example set variable ${S ₁ } ~${S _j }.