CN114398482A

CN114398482A - A dictionary construction method, device, electronic device and storage medium

Info

Publication number: CN114398482A
Application number: CN202111475744.5A
Authority: CN
Inventors: 胡飞雄; 朱磊; 朱晓燕; 张雨; 付晨阳; 何萍
Original assignee: Tencent Cyber Tianjin Co Ltd
Current assignee: Tencent Cyber Tianjin Co Ltd
Priority date: 2021-12-06
Filing date: 2021-12-06
Publication date: 2022-04-26
Anticipated expiration: 2041-12-06
Also published as: CN114398482B

Abstract

The present application provides a dictionary construction method, device, electronic device and storage medium, and relates to the technical field of natural language processing. After obtaining the text to be processed and a basic dictionary containing multiple word categories, the method selects at least one target word whose semantic information belongs to a named entity from the text to be processed based on the trained category prediction model, and determines at least one target word. The target words respectively belong to the probability values of multiple word categories, so that at least one target word is respectively assigned to the word category whose corresponding probability value meets the set probability condition. Since the target words belonging to named entities in the text and the probability values of the target words belonging to each named entity category in the basic dictionary can be accurately determined according to the semantic information, the target words can be accurately assigned to the corresponding named entity categories. , to improve the accuracy of dictionary construction.

Description

A dictionary construction method, device, electronic device and storage medium

技术领域technical field

本申请实施例涉及自然语言处理技术领域，尤其涉及一种词典构造方法、装置、电子设备及存储介质。The embodiments of the present application relate to the technical field of natural language processing, and in particular, to a dictionary construction method, an apparatus, an electronic device, and a storage medium.

背景技术Background technique

在对文本进行信息识别时，通常会先进行词典构造，将各种类别的词语加入至词典中，再采用构造好的词典对文本进行识别，就能快速并准确地识别出文本中所包含的词语所属的类别。When recognizing text information, a dictionary is usually constructed first, various categories of words are added to the dictionary, and then the constructed dictionary is used to recognize the text, so that the text contained in the text can be quickly and accurately identified. The category to which the word belongs.

在相关技术中，常采用基于N-Gram的方法进行词典构造。基于N-Gram 的词典构造方法，是先采用N-Gram算法对文本进行切分，得到N个词语，然后使用多个现有的词典进行对比，从N个词语中选择出新词加入至词典中。In the related art, the method based on N-Gram is often used for dictionary construction. The dictionary construction method based on N-Gram is to first use the N-Gram algorithm to segment the text to obtain N words, then use multiple existing dictionaries for comparison, and select new words from the N words to add to the dictionary middle.

由于该方法所提取的新词与采用N-Gram算法得到的N个词语相关，当采用N-Gram算法切分得到的词语不准确时，从N个词语中提取出的新词就不够准确，因此，构造的词典的准确率也就不高。Since the new words extracted by this method are related to the N words obtained by the N-Gram algorithm, when the words obtained by the N-Gram algorithm are inaccurate, the new words extracted from the N words are not accurate enough. Therefore, the accuracy of the constructed dictionary is not high.

发明内容SUMMARY OF THE INVENTION

为解决相关技术中存在的技术问题，本申请实施例提供一种词典构造方法、装置、电子设备及存储介质，可以提高词典构造的准确率。In order to solve the technical problems existing in the related art, the embodiments of the present application provide a dictionary construction method, an apparatus, an electronic device and a storage medium, which can improve the accuracy of dictionary construction.

本申请实施例提供的具体技术方案如下：The specific technical solutions provided by the embodiments of the present application are as follows:

一种词典构造方法，包括：A dictionary construction method that includes:

获取待处理文本和基础词典；其中，所述基础词典包含多个词语类别；Obtain the text to be processed and a basic dictionary; wherein, the basic dictionary contains multiple word categories;

基于已训练的类别预测模型，确定出所述待处理文本包含的至少一个候选词语，以及所述至少一个候选词语各自的语义信息；Based on the trained category prediction model, determine at least one candidate word included in the text to be processed, and the respective semantic information of the at least one candidate word;

通过所述类别预测模型，根据所述至少一个候选词语各自的语义信息，选取出符合设定语义条件的至少一个目标词语，并确定所述至少一个目标词语各自分别属于所述多个词语类别的概率值；Through the category prediction model, according to the respective semantic information of the at least one candidate word, at least one target word that meets the set semantic conditions is selected, and it is determined that the at least one target word belongs to each of the multiple word categories. probability value;

将所述至少一个目标词语，分别归属至对应的概率值符合设定概率条件的词语类别中。The at least one target word is respectively assigned to word categories whose corresponding probability values meet the set probability conditions.

一种词典构造装置，包括：A dictionary construction device, comprising:

获取模块，用于获取待处理文本和基础词典；其中，所述基础词典包含多个词语类别；an acquisition module for acquiring the text to be processed and a basic dictionary; wherein the basic dictionary contains multiple word categories;

词语识别模块，用于基于已训练的类别预测模型，确定出所述待处理文本包含的至少一个候选词语，以及所述至少一个候选词语各自的语义信息；A word recognition module, configured to determine at least one candidate word contained in the text to be processed based on the trained category prediction model, and the respective semantic information of the at least one candidate word;

类别识别模块，用于通过所述类别预测模型，根据所述至少一个候选词语各自的语义信息，选取出符合设定语义条件的至少一个目标词语，并确定所述至少一个目标词语各自分别属于所述多个词语类别的概率值；The category recognition module is used to select at least one target word that meets the set semantic conditions according to the respective semantic information of the at least one candidate word through the category prediction model, and determine that the at least one target word belongs to the The probability value of describing multiple word categories;

词典构造模块，用于将所述至少一个目标词语，分别归属至对应的概率值符合设定概率条件的词语类别中。The dictionary construction module is used for attributing the at least one target word to word categories whose corresponding probability values meet the set probability conditions, respectively.

可选的，所述类别预测模型包括预训练语言子模型和命名实体识别子模型；所述词语识别模块，具体用于：Optionally, the category prediction model includes a pre-trained language sub-model and a named entity recognition sub-model; the word recognition module is specifically used for:

基于所述待处理文本，通过所述预训练语言子模型，得到所述待处理文本包含的至少一个单词各自对应的词向量；其中，每个词向量表征相应单词的语义信息；Based on the text to be processed, through the pre-trained language sub-model, a word vector corresponding to at least one word contained in the text to be processed is obtained; wherein, each word vector represents the semantic information of the corresponding word;

基于所述至少一个单词各自对应的词向量，通过所述命名实体识别子模型，对所述至少一个单词进行组合，得到至少一个候选词语，以及所述至少一个候选词语各自的语义信息；其中，每个候选词语包含至少一个单词。Based on the respective word vectors corresponding to the at least one word, the at least one word is combined by the named entity recognition sub-model to obtain at least one candidate word and the respective semantic information of the at least one candidate word; wherein, Each candidate word contains at least one word.

可选的，所述类别识别模块，具体用于：Optionally, the category identification module is specifically used for:

通过所述命名实体识别子模型，从所述至少一个候选词语中，选取出语义信息属于命名实体的至少一个目标词语；所述命名实体为具有特定语义的实体名称；Through the named entity recognition sub-model, from the at least one candidate word, at least one target word whose semantic information belongs to a named entity is selected; the named entity is an entity name with specific semantics;

针对所述至少一个目标词语，分别执行以下操作：通过所述命名实体识别子模型，根据所述一个目标词语的语义信息，确定所述一个目标词语分别属于所述多个词语类别的概率值。For the at least one target word, the following operations are respectively performed: using the named entity recognition sub-model, according to the semantic information of the one target word, determine a probability value that the one target word belongs to the multiple word categories respectively.

可选的，还包括模型训练模块，所述模型训练模块用于：Optionally, it also includes a model training module, and the model training module is used for:

获取训练数据集；所述训练数据集中包括多个文本数据样本，所述文本数据样本中标注有设定类别；Obtaining a training data set; the training data set includes a plurality of text data samples, and the text data samples are marked with set categories;

基于所述训练数据集，对所述类别预测模型进行迭代训练，直到满足设定的收敛条件为止，其中，一次迭代训练过程包括：Based on the training data set, the category prediction model is iteratively trained until a set convergence condition is met, wherein an iterative training process includes:

基于从所述训练数据集中抽取的文本数据样本，通过所述类别预测模型，确定所述文本数据样本中的至少一个目标词语，并确定所述至少一个目标词语各自对应的目标词语类别；Based on the text data samples extracted from the training data set, through the category prediction model, at least one target word in the text data sample is determined, and the target word category corresponding to each of the at least one target word is determined;

根据所述目标词语类别与所述设定类别，确定相应的损失值，并根据所述损失值，对所述类别预测模型进行参数调整。According to the target word category and the set category, a corresponding loss value is determined, and according to the loss value, the parameters of the category prediction model are adjusted.

可选的，所述类别预测模型包括预训练语言子模型和命名实体识别子模型；所述训练数据集包括百科文本数据样本、领域文本数据样本和命名实体识别文本数据样本；所述模型训练模块还用于：Optionally, the category prediction model includes a pre-trained language sub-model and a named entity recognition sub-model; the training data set includes encyclopedia text data samples, domain text data samples and named entity recognition text data samples; the model training module Also used for:

基于从所述百科文本数据样本和领域文本数据样本中，抽取的文本数据样本，通过所述预训练语言子模型，确定相应的嵌入向量样本；所述百科文本数据样本为无针对性领域的文本数据，每个领域文本数据样本为包括多个设定类别命名实体的文本数据，且标注有对应的领域类别；Based on the text data samples extracted from the encyclopedia text data samples and the domain text data samples, the corresponding embedding vector samples are determined through the pre-trained language sub-model; the encyclopedia text data samples are texts in untargeted fields Data, each domain text data sample is text data that includes multiple named entities of set categories, and is marked with the corresponding domain category;

基于从所述嵌入向量样本和词语向量样本中抽取的向量样本，通过所述命名实体识别子模型，确定相应的至少一个目标词语及其各自对应的目标词语类别；所述词语向量样本是基于所述命名实体识别文本数据样本得到的；所述命名实体识别文本数据样本为包括至少一个命名实体的文本数据，且每个命名实体词语标注有对应的命名实体类别。Based on the vector samples extracted from the embedding vector samples and the word vector samples, the named entity recognition sub-model determines at least one corresponding target word and its corresponding target word category; the word vector samples are based on the The named entity identification text data sample is obtained by describing the named entity identification text data sample; the named entity identification text data sample is text data including at least one named entity, and each named entity word is marked with a corresponding named entity category.

可选的，所述模型训练模块还用于：Optionally, the model training module is also used for:

根据所述目标词语类别与所述领域类别，确定第一损失值，并根据所述第一损失值对所述预训练语言子模型进行参数调整；Determine a first loss value according to the target word category and the domain category, and adjust the parameters of the pre-trained language sub-model according to the first loss value;

根据所述目标词语类别与所述领域类别、所述命名实体类别，确定第二损失值，并根据所述第二损失值对所述命名实体识别子模型进行参数调整。According to the target word category, the domain category, and the named entity category, a second loss value is determined, and parameters of the named entity recognition sub-model are adjusted according to the second loss value.

对所述训练数据集中的百科文本数据样本、领域文本数据样本和命名实体识别文本数据样本，分别进行预处理操作；所述预处理操作包括数据筛选和格式转换中的至少一种。Preprocessing operations are respectively performed on the encyclopedia text data samples, the domain text data samples and the named entity recognition text data samples in the training data set; the preprocessing operations include at least one of data screening and format conversion.

可选的，所述词典构造模块，具体用于：Optionally, the dictionary construction module is specifically used for:

基于一个目标词语分别属于所述多个词语类别的概率值，确定出最大概率值对应的词语类别，并将所述词语类别作为目标类别；Based on the probability values that a target word belongs to the multiple word categories respectively, determine the word category corresponding to the maximum probability value, and use the word category as the target category;

若所述一个目标词语属于所述目标类别的概率值大于第一设定阈值，则将所述一个目标词语归属至所述目标类别中。If the probability value of the one target word belonging to the target category is greater than the first set threshold, the one target word is assigned to the target category.

可选的，所述词典构造模块，还用于：Optionally, the dictionary construction module is further used for:

若所述一个目标词语分别属于所述多个词语类别的概率值不大于第二设定阈值，则选取出概率值大于第三设定阈值的至少一个词语类别，作为候选类别；所述第三设定阈值小于所述第二设定阈值；If the probability value of the target word belonging to the multiple word categories is not greater than the second set threshold, select at least one word category whose probability value is greater than the third set threshold as a candidate category; the third The set threshold is smaller than the second set threshold;

基于所述一个目标词语分别与所述至少一个候选类别各自包含的词语的相似度，选取出符合设定相似度条件的候选类别，作为目标类别，并将所述一个目标词语归属至所述目标类别中。Based on the similarity between the one target word and the words contained in the at least one candidate category, a candidate category that meets the set similarity condition is selected as the target category, and the one target word is assigned to the target in the category.

本申请实施例提供的一种电子设备，其包括处理器和存储器，其中，所述存储器存储有程序代码，当所述程序代码被所述处理器执行时，使得所述处理器执行上述任意一种词典构造方法的步骤。An electronic device provided by an embodiment of the present application includes a processor and a memory, wherein the memory stores a program code, and when the program code is executed by the processor, the processor is made to execute any one of the above steps of a dictionary construction method.

本申请实施例提供的一种计算机可读存储介质，其包括程序代码，当所述程序代码在电子设备上运行时，所述程序代码用于使所述电子设备执行上述任意一种词典构造方法的步骤。A computer-readable storage medium provided by an embodiment of the present application includes program code, and when the program code runs on an electronic device, the program code is used to make the electronic device execute any of the above dictionary construction methods A step of.

本申请实施例提供的一种计算机程序产品，包括计算机程序/指令，当其在计算机上运行时，使得计算机执行上述词典构造方法。A computer program product provided by an embodiment of the present application includes a computer program/instruction, which, when running on a computer, causes the computer to execute the above dictionary construction method.

本申请有益效果如下：The beneficial effects of this application are as follows:

本申请实施例提供了词典构造方法、装置、电子设备及存储介质，在获取到待处理文本和包含有多个词语类别的基础词典后，可以基于已训练的类别预测模型，从待处理文本中选取出语义信息属于命名实体的至少一个目标词语，以及，确定出至少一个目标词语各自分别属于多个词语类别的概率值，并将至少一个目标词语，分别归属至对应的概率值符合设定概率条件的词语类别中。与相关技术中的基于N-Gram的词典构建方法相比，由于本方案可以基于类别预测模型，根据词语的语义信息准确地确定出待处理文本中属于命名实体的目标词语，并且根据目标词语属于基础词典中的命名实体类别的概率值，确定出目标词语所归属的目标命名实体类别，因此可以提高词典构造的准确率。The embodiments of the present application provide a dictionary construction method, device, electronic device, and storage medium. After the text to be processed and a basic dictionary containing multiple word categories are acquired, based on the trained category prediction model, the text to be processed can be extracted from the text to be processed. Selecting at least one target word whose semantic information belongs to a named entity, and determining the probability values of the at least one target word respectively belonging to multiple word categories, and assigning the at least one target word to the corresponding probability value that conforms to the set probability condition in the word category. Compared with the N-Gram-based dictionary construction method in the related art, because this scheme can accurately determine the target words belonging to the named entities in the text to be processed based on the category prediction model and the semantic information of the words, and according to the target words belonging to the named entity. The probability value of the named entity category in the basic dictionary determines the target named entity category to which the target word belongs, so the accuracy of the dictionary construction can be improved.

本申请的其它特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本申请而了解。本申请的目的和其他优点可通过在所写的说明书、权利要求书、以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description, claims, and drawings.

附图说明Description of drawings

图1a为本申请实施例中的一种应用场景示意图；FIG. 1a is a schematic diagram of an application scenario in an embodiment of the present application;

图1b为本申请实施例中的另一种应用场景示意图；FIG. 1b is a schematic diagram of another application scenario in the embodiment of the present application;

图2a为本申请实施例中对类别预测模型进行训练的流程示意图；2a is a schematic flowchart of training a category prediction model in an embodiment of the present application;

图2b为本申请实施例中对命名实体识别文本数据样本进行标注的示意图；2b is a schematic diagram of annotating a named entity recognition text data sample in an embodiment of the present application;

图2c为本申请实施例中嵌入向量样本的输出示意图；FIG. 2c is a schematic diagram of the output of embedded vector samples in an embodiment of the present application;

图2d为本申请实施例中的一种对预训练语言子模型进行训练的示意图；2d is a schematic diagram of training a pre-trained language sub-model according to an embodiment of the application;

图2e为本申请实施例中的另一种对预训练语言子模型进行训练的示意图；2e is another schematic diagram of training the pre-trained language sub-model in the embodiment of the application;

图2f为本申请实施例中对命名实体识别子模型进行训练的示意图；2f is a schematic diagram of training a named entity recognition sub-model in an embodiment of the present application;

图3为本申请实施例中对训练数据集进行分析和处理的流程示意图；3 is a schematic flowchart of analyzing and processing a training data set in the embodiment of the present application;

图4a为本申请实施例中词典构造方法的流程示意图；4a is a schematic flowchart of a dictionary construction method in an embodiment of the application;

图4b为本申请实施例中确定候选词语的示意图；4b is a schematic diagram of determining candidate words in an embodiment of the application;

图4c为本申请实施例中确定类别预测模型输出结果的示意图；FIG. 4c is a schematic diagram of determining the output result of the category prediction model in the embodiment of the present application;

图4d为本申请实施例中的一种确定目标类别的流程示意图；FIG. 4d is a schematic flowchart of determining a target category in an embodiment of the present application;

图5为本申请实施例中的另一种确定目标类别的流程示意图；FIG. 5 is another schematic flowchart of determining a target category in an embodiment of the present application;

图6为本申请实施例中词典构造方法的示意图；6 is a schematic diagram of a dictionary construction method in an embodiment of the application;

图7为本申请实施例中的一种词典构造装置的结构示意图；FIG. 7 is a schematic structural diagram of a dictionary construction device in an embodiment of the application;

图8为本申请实施例中的另一种词典构造装置的结构示意图；8 is a schematic structural diagram of another dictionary construction device in an embodiment of the application;

图9为本申请实施例的一种电子设备的一个硬件组成结构示意图；9 is a schematic structural diagram of a hardware composition of an electronic device according to an embodiment of the application;

图10为本申请实施例中的一个计算装置的结构示意图。FIG. 10 is a schematic structural diagram of a computing device in an embodiment of the present application.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请的技术方案进行清楚、完整地描述，显然，所描述的实施例是本申请技术方案的一部分实施例，而不是全部的实施例。基于本申请文件中记载的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本申请技术方案保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are of the present application. Some embodiments of the technical solution, but not all embodiments. All other embodiments obtained by persons of ordinary skill in the art without creative work based on the embodiments recorded in the present application documents fall within the protection scope of the technical solutions of the present application.

本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够在除了这里图示或描述的那些以外的顺序实施。The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein.

以下对本申请实施例中的部分用语进行解释说明，以便于本领域技术人员理解。Some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

词典：从文本中抽取预定义类别的词语来构建所需要的词典，在对某一文本进行信息识别时，可以采用构建好的词典对该文本进行识别，能够快速并准确地识别出文本中所包含的词语所属的类别。Dictionary: Extract predefined categories of words from the text to construct the required dictionary. When identifying a text, the constructed dictionary can be used to identify the text, which can quickly and accurately identify all the words in the text. The category to which the included word belongs.

命名实体：文本中具有特定语义的实体名称，主要包括人名、地名、机构名、专有名词等。Named entities: entity names with specific semantics in the text, mainly including person names, place names, institution names, proper nouns, etc.

下文中所用的词语“示例性”的意思为“用作例子、实施例或说明性”。作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。As used hereinafter, the word "exemplary" means "serving as an example, embodiment, or illustration." Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

文中的术语“第一”、“第二”仅用于描述目的，而不能理解为明示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征，在本申请实施例的描述中，除非另有说明，“多个”的含义是两个或两个以上。The terms "first" and "second" in the text are only used for the purpose of description, and should not be construed as expressing or implying relative importance or implying the number of indicated technical features. Therefore, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present application, unless otherwise specified, the "multiple" The meaning is two or more.

本申请实施例涉及人工智能(ArtificialIntelligence，AI)和机器学习(MachineLearning，ML)技术和自然语言处理(Nature Language processing， NLP)，基于人工智能中的机器学习技术和自然语言处理技术而设计。The embodiments of the present application relate to artificial intelligence (Artificial Intelligence, AI) and machine learning (Machine Learning, ML) technology and natural language processing (Nature Language processing, NLP), and are designed based on the machine learning technology and natural language processing technology in artificial intelligence.

人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说，人工智能是计算机科学的一个综合技术，它企图了解智能的实质，并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法，使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

人工智能技术是一门综合学科，涉及领域广泛，既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习、自动驾驶、智慧交通等几大方向。Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning, autonomous driving, and smart transportation.

机器学习(Machine Learning，ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Machine Learning (ML) is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in how computers simulate or realize human learning behaviors to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its applications are in all fields of artificial intelligence. Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.

自然语言处理(Nature Language processing，NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。Natural Language Processing (NLP) is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that can realize effective communication between humans and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, the language that people use on a daily basis, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.

本申请实施例采用基于机器学习的类别预测模型，确定出文本中的至少一个目标词语，并确定至少一个目标词语各自分别属于基础词典中的多个词语类别的概率值。The embodiment of the present application adopts a category prediction model based on machine learning to determine at least one target word in the text, and determine the probability values that each of the at least one target word belongs to multiple word categories in the basic dictionary.

下面对本申请实施例的设计思想进行简要介绍：The design ideas of the embodiments of the present application are briefly introduced below:

词典可以用于对文本进行信息识别，以快速并准确地确定出文本中的词语所属于的预定义类别。相关技术中，常采用基于N-Gram的方法进行词典构造，该方法首先基于N-Gram算法对文本进行切分，得到N个词语，然后使用多个现有的词典进行对比，从N个词语中选择出新词加入至词典中。然而，该方法采用的N-Gram算法无法准确对文本中的词语进行切分，导致最终构造的词典的准确率也不高。Dictionaries can be used for informational identification of text to quickly and accurately determine the predefined categories to which words in the text belong. In the related art, the method based on N-Gram is often used for dictionary construction. This method firstly divides the text based on the N-Gram algorithm to obtain N words, and then uses multiple existing dictionaries to compare, from the N words. to select new words to add to the dictionary. However, the N-Gram algorithm used in this method cannot accurately segment the words in the text, resulting in a low accuracy of the final constructed dictionary.

有鉴于此，本申请实施例提供一种词典构造方法、装置、电子设备及存储介质，可以先基于已训练的类别预测模型，根据文本中词语的语义信息，从待处理文本中选取出语义信息属于命名实体的目标词语，再根据目标词语属于词典中的各个命名实体类别的概率值，将目标词语归属至词典中的目标命名实体类别中，完成对词典的构造，从而可以提高词典构造的准确率。In view of this, the embodiments of the present application provide a dictionary construction method, device, electronic device, and storage medium, which can first select semantic information from the text to be processed based on the semantic information of the words in the text based on the trained category prediction model. The target word belonging to the named entity, and then according to the probability value of the target word belonging to each named entity category in the dictionary, the target word is assigned to the target named entity category in the dictionary to complete the construction of the dictionary, which can improve the accuracy of dictionary construction. Rate.

以下结合说明书附图对本申请的优选实施例进行说明，应当理解，此处所描述的优选实施例仅用于说明和解释本申请，并不用于限定本申请，并且在不冲突的情况下，本申请实施例及实施例中的特征可以相互组合。The preferred embodiments of the present application will be described below with reference to the accompanying drawings. It should be understood that the preferred embodiments described herein are only used to illustrate and explain the present application, and are not intended to limit the present application. The embodiments and features of the embodiments can be combined with each other.

参阅图1a所示，其为本申请实施例中应用场景示意图。该应用场景图中包括终端设备100和服务器200。终端设备100与服务器200之间可以通过通信网络进行通信。可选地，通信网络可以是有线网络或无线网络。终端设备100 与服务器200可以通过有线或无线通信方式进行直接或间接地连接，本申请在此不做限制。Referring to FIG. 1a, it is a schematic diagram of an application scenario in the embodiment of the present application. The application scenario diagram includes the terminal device 100 and the server 200 . Communication between the terminal device 100 and the server 200 may be performed through a communication network. Alternatively, the communication network may be a wired network or a wireless network. The terminal device 100 and the server 200 may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.

在本申请实施例中，终端设备100为用户使用的电子设备，该电子设备可以是个人计算机、手机、平板电脑、笔记本、电子书阅读器、智能家居、车载终端等设备。服务器200可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network，CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。In this embodiment of the present application, the terminal device 100 is an electronic device used by a user, and the electronic device may be a personal computer, a mobile phone, a tablet computer, a notebook, an e-book reader, a smart home, a vehicle terminal, and other devices. The server 200 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Cloud servers for basic cloud computing services such as middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms.

示例性地，用户在终端设备100上进行文本浏览时，可以通过终端设备100 将浏览的文本发送给服务器200，服务器200可以基于已训练的类别预测模型，确定出该文本包含的至少一个候选词语，以及至少一个候选词语各自的语义信息，并通过类别预测模型，根据至少一个候选词语各自的语义信息，选取出符合设定语义条件的至少一个目标词语，以及确定至少一个目标词语各自分别属于基础词典中的多个词语类别的概率值。在确定出概率值后，针对至少一个目标词语，可以分别执行以下操作：基于一个目标词语分别属于多个词语类别的概率值，选取出符合设定概率条件的词语类别，作为相应的目标类别，并将一个目标词语归属至目标类别中，从而完成词典的构造和扩展。Exemplarily, when the user browses text on the terminal device 100, the terminal device 100 may send the browsed text to the server 200, and the server 200 may determine at least one candidate word contained in the text based on the trained category prediction model. , and the respective semantic information of at least one candidate word, and through the category prediction model, according to the respective semantic information of at least one candidate word, select at least one target word that meets the set semantic conditions, and determine that at least one target word belongs to the basic Probability values for multiple word categories in the dictionary. After the probability value is determined, for at least one target word, the following operations may be performed respectively: based on the probability values of a target word belonging to multiple word categories, select a word category that meets the set probability conditions as the corresponding target category, And a target word is assigned to the target category, so as to complete the construction and expansion of the dictionary.

此外，服务器200在确定出文本中所包含的目标词语，以及目标词语所属于的词语类别后，还可以将结果发送给终端设备100，以使终端100向用户展示该文本中所包含的词语所属的类别，达到对文本进行信息识别的目的。In addition, after determining the target word contained in the text and the word category to which the target word belongs, the server 200 may also send the result to the terminal device 100, so that the terminal 100 can show the user that the word contained in the text belongs to category, to achieve the purpose of text information recognition.

应当说明的是，图1a是对本申请的词典构造方法的应用场景进行示例介绍，实际本申请实施例中的方法可以适用的应用场景并不限于此。It should be noted that FIG. 1a is an example introduction of the application scenario of the dictionary construction method of the present application, and the actual application scenario to which the method in the embodiment of the present application can be applied is not limited to this.

在一些实施例中，本申请实施例中的应用场景还可以如图1b所示，包括终端设备100和服务器200。其中，服务器200中包括一个词典，且该词典为通过本申请中的词典构造方法构造得到的词典。In some embodiments, the application scenarios in the embodiments of the present application may also be shown in FIG. 1 b , including the terminal device 100 and the server 200 . The server 200 includes a dictionary, and the dictionary is a dictionary constructed by the dictionary construction method in this application.

具体地，终端设备100可以待识别文本发送给服务器200，服务器200中的词典在接收到待识别文本后，可以识别出待识别文本中的命名实体，并将得到的识别结果发送给终端设备100。Specifically, the terminal device 100 can send the text to be recognized to the server 200 , and after receiving the text to be recognized, the dictionary in the server 200 can recognize the named entity in the text to be recognized, and send the obtained recognition result to the terminal device 100 .

例如，待识别文本为“小花在中秋的早上5点登上泰山看日出”，终端设备100将待识别文本发送给服务器200后，通过服务器200中的词典对待识别文本进行识别，就可以识别出待识别文本中的人名为“小花”，节日为“中秋”，时间为“早上5点”，地名为“泰山”，即“人名：小花、节日：中秋、时间：早上5点、地名：泰山”。服务器200在得到识别结果后，可以将识别结果发送给终端设备100。For example, if the text to be recognized is "Xiaohua climbs Mount Tai at 5:00 in the morning to watch the sunrise", after the terminal device 100 sends the text to be recognized to the server 200, it can recognize the text to be recognized through the dictionary in the server 200. The name of the person in the text to be recognized is "Xiaohua", the festival is "Mid-Autumn Festival", the time is "5 am", and the place name is "Taishan", that is, "person name: Xiaohua, festival: Mid-Autumn Festival, time: 5 am, place name : Mount Tai". After obtaining the identification result, the server 200 may send the identification result to the terminal device 100 .

进一步地，终端设备100在接收到识别结果后，可以根据识别结果中的命名实体，对待识别文本进行分析和管理。例如，假设待识别文本是新闻消息，则通过词典就可以识别出新闻消息中的命名实体，从而提取出新闻消息中的重要信息，以根据提取出的重要信息，分析得到该新闻消息为一个假的新闻消息，或者为一个包含有敏感词汇的新闻消息等。Further, after receiving the recognition result, the terminal device 100 can analyze and manage the text to be recognized according to the named entity in the recognition result. For example, assuming that the text to be identified is a news message, the named entity in the news message can be identified through the dictionary, so as to extract the important information in the news message, and analyze the news message as a fake according to the extracted important information. news, or a news message that contains sensitive words, etc.

首先，对本申请实施例中的类别预测模型的训练过程进行详细阐述，该过程可以由服务器执行，例如图1a中的服务器200。参阅图2a所示，为本申请实施例中类别预测模型的训练过程的示意图，下面结合附图2a，对本申请实施例中类别预测模型的训练过程进行详细阐述。First, the training process of the category prediction model in the embodiment of the present application is described in detail, and the process may be executed by a server, such as the server 200 in FIG. 1a. Referring to Fig. 2a, which is a schematic diagram of the training process of the category prediction model in the embodiment of the present application, the following describes the training process of the category prediction model in the embodiment of the present application in detail with reference to Fig. 2a.

步骤S201，获取训练数据集。Step S201, acquiring a training data set.

获取到的训练数据集中可以包括多个文本数据样本。具体地，训练数据集中可以包括百科文本数据样本、领域文本数据样本和命名实体识别文本数据样本。The obtained training data set may include multiple text data samples. Specifically, the training data set may include encyclopedia text data samples, domain text data samples and named entity recognition text data samples.

其中，百科文本数据样本为无针对性领域的文本数据。例如，文本“Shippingcompany Evergreen Marine leased container ships longer grant round in theEgyptian Suez Canal stranded re-float rescue 6days,the canal restoretraffic.”和文本“Hong Kong veteran actor Liao Qizhi,who won the Hong Kong FilmAwards for Best Supporting Actor,died in Prince of Wales Hospital at the ageof 66.”等均可以为百科文本数据。采用百科文本数据作为样本数据对类别预测模型进行训练时，不需要对每个百科文本数据样本进行标注。Among them, the encyclopedia text data samples are text data in untargeted fields. For example, the text "Shippingcompany Evergreen Marine leased container ships longer grant round in the Egyptian Suez Canal stranded re-float rescue 6days, the canal restoretraffic." and the text "Hong Kong veteran actor Liao Qizhi, who won the Hong Kong FilmAwards for Best Supporting Actor, died in Prince of Wales Hospital at the age of 66.” etc. can be encyclopedia text data. When using the encyclopedia text data as sample data to train the category prediction model, it is not necessary to label each encyclopedia text data sample.

领域文本数据样本为包括多个设定类别命名实体，即与特定命名实体高度相关，或含有较多某命名实体的文本数据。例如，文本“After the Danish government sold theCaribbean colony of the West Indies to the United States,the United Statesbegan to control the U.S.Virgin Islands.”可以为一个领域文本数据，且该文本为一个地点领域文本数据。采用领域文本数据作为样本数据对类别预测模型进行训练时，需要对每个领域文本数据样本标注对应的领域类别。例如，对于上述领域文本数据，由于该文本数据中含有的地点命名实体较多，则可以将该文本数据标注为“Location”。Domain text data samples include multiple named entities of a given category, that is, text data that are highly related to a specific named entity, or contain many named entities. For example, the text "After the Danish government sold the Caribbean colony of the West Indies to the United States, the United Statesbegan to control the U.S.Virgin Islands." can be a domain text data, and the text is a location domain text data. When using domain text data as sample data to train the category prediction model, it is necessary to label the corresponding domain category for each domain text data sample. For example, for the above-mentioned domain text data, since the text data contains many location named entities, the text data can be marked as "Location".

命名实体识别文本数据样本为包括至少一个命名实体，即需要进行次级命名实体标注的文本数据。例如，文本“A cruise ship carrying tourists on Qiandao Lake inZhejiang Province in southeastern China.”可以为一个命名实体识别文本数据。在采用命名实体识别文本数据作为样本数据对类别预测模型进行训练时，需要对每个命名实体识别文本数据中包括的每个命名实体词语，标注对应的命名实体类别。对于每个命名实体识别文本数据样本，可以采用“BIOES标注法”标注出该命名实体识别文本数据样本中包含的每个词语所属命名实体类别以及相对位置。例如，如图2b所示，对于命名实体识别文本数据“A ship carrying tourists on Qiandao Lake in Zhejiang Province”，对该命名实体识别文本数据采用“BIOES标注法”进行标注，可以得到“O O O S-LOC O O B-LOC E-LOCO”。其中，B表示开始位置，I表示中间位置，E表示结束位置，S表示单独命名实体，后接标签表示实体类别，比如，“-LOC”表示该命名实体属于地点命名实体，O表示非命名实体。Named entity recognition text data samples include at least one named entity, that is, text data that needs to be labeled with secondary named entities. For example, the text "A cruise ship carrying tourists on Qiandao Lake inZhejiang Province in southeastern China." can identify text data for a named entity. When using named entity recognition text data as sample data to train the category prediction model, it is necessary to label each named entity word included in each named entity recognition text data with the corresponding named entity category. For each named entity recognition text data sample, the "BIOES labeling method" can be used to mark the named entity category and relative position of each word included in the named entity recognition text data sample. For example, as shown in Figure 2b, for the named entity recognition text data "A ship carrying tourists on Qiandao Lake in Zhejiang Province", the named entity recognition text data is marked with the "BIOES labeling method", and the "O O O S-LOC" can be obtained. O O B-LOC E-LOCO". Among them, B represents the start position, I represents the middle position, E represents the end position, S represents a single named entity, followed by a label to represent the entity category, for example, "-LOC" represents the named entity belongs to the location named entity, O represents the non-named entity .

步骤S202，从百科文本数据样本和领域文本数据样本中，抽取文本数据样本。Step S202, extracting text data samples from the encyclopedia text data samples and the domain text data samples.

待训练的类别预测模型中包括预训练语言子模型和命名实体识别子模型，在对预训练语言子模型进行训练时，可以从训练数据集的百科文本数据样本和领域文本数据样本中，抽取文本数据样本作为训练样本数据。The category prediction model to be trained includes a pre-trained language sub-model and a named entity recognition sub-model. When training the pre-trained language sub-model, text can be extracted from the encyclopedia text data samples and domain text data samples in the training dataset. The data samples are used as training sample data.

步骤S203，将抽取的文本数据样本输入到待训练的预训练语言子模型中，得到相应的嵌入向量样本。Step S203: Input the extracted text data samples into the pre-trained language sub-model to be trained to obtain corresponding embedding vector samples.

将抽取的文本数据样本输入到待训练的预训练语言子模型中，基于预训练语言子模型对文本数据样本进行特征提取，得到相应的嵌入向量样本，该嵌入向量样本可以包括相应的文本数据样本中的各个单词对应的词向量。例如，对于一个文本数据样本“Afterthe Danish government sold the Caribbean colony”，在通过预训练语言子模型后，可以得到如图2c所示的嵌入向量样本，即E1、 E2、E3、…E8。Input the extracted text data samples into the pre-trained language sub-model to be trained, and perform feature extraction on the text data samples based on the pre-trained language sub-model to obtain corresponding embedding vector samples, which may include corresponding text data samples. The word vector corresponding to each word in . For example, for a text data sample "After the Danish government sold the Caribbean colony", after pre-training the language sub-model, the embedding vector samples shown in Figure 2c can be obtained, namely E1, E2, E3,...E8.

具体地，如图2d所示，在将抽取的文本数据样本输入到待训练的预训练语言子模型中之前，还需要对该文本数据样本进行处理，处理成[CLS]+分词 +[SEP]结构，即将文本数据样本处理成“[CLS]、I1、I2、I3、…、[SEP]”结构。其中，特殊标记[CLS]可用于领域类别的识别任务。Specifically, as shown in Figure 2d, before inputting the extracted text data samples into the pre-trained language sub-model to be trained, the text data samples also need to be processed into [CLS] + word segmentation + [SEP] Structure, that is, processing text data samples into "[CLS], I1, I2, I3, ..., [SEP]" structures. Among them, the special label [CLS] can be used for the recognition task of domain categories.

在将文本数据样本处理成“[CLS]、I1、I2、I3、…、[SEP]”结构后，还可以对文本数据样本进行随机Mask处理，将文本数据样本处理成“[CLS]、I1、 [MASK]、I3、…、[SEP]”结构。采用随机Mask机制对输入预训练语言子模型的文本数据样本进行处理，是为了使得预训练语言子模型可以完成Mask预测任务。其中，随机Mask机制是随机遮盖或替换一句话里面任意字或词,然后让预训练语言子模型通过上下文的理解预测那一个被遮盖或替换的部分。After the text data samples are processed into "[CLS], I1, I2, I3, ..., [SEP]" structures, random Mask processing can also be performed on the text data samples, and the text data samples can be processed into "[CLS], I1 , [MASK], I3, …, [SEP]” structure. The random Mask mechanism is used to process the text data samples input to the pre-trained language sub-model, so that the pre-trained language sub-model can complete the task of Mask prediction. Among them, the random mask mechanism is to randomly cover or replace any word or word in a sentence, and then let the pre-trained language sub-model predict the covered or replaced part through contextual understanding.

在对文本数据样本进行随机Mask处理后，可以采用该处理后的文本数据样本对预训练语言子模型进行训练。首先，对该文本数据样本进行embedding，即向量化，得到“E[CLS]、E1、E[MASK]、E3、…、E[SEP]”，然后，让向量化后的文本数据样本经过预训练语言子模型中的多个隐藏层，得到输出结果“C、 T1、T[MASK]、T3、…、T[SEP]”。由于输入到预训练语言子模型中的文本数据样本中还包括标注有领域类别的领域文本数据样本，因此，通过对预训练语言子模型的训练，可以让预训练语言子模型对输入的文本数据样本的领域类别进行预测。同时采用随机Mask机制和领域类别识别两个任务，进行预训练语言子模型的训练，能够让预训练语言子模型学习到更多的语义信息。After performing random Mask processing on the text data samples, the pre-trained language sub-model can be trained by using the processed text data samples. First, embed the text data sample, that is, vectorize, to obtain "E[CLS], E1, E[MASK], E3, ..., E[SEP]", and then let the vectorized text data sample undergo pre-processing Train multiple hidden layers in the language sub-model to get the output "C, T1, T[MASK], T3, ..., T[SEP]". Since the text data samples input into the pre-trained language sub-model also include domain text data samples marked with domain categories, therefore, through the training of the pre-trained language sub-model, the pre-trained language sub-model can be used for the input text data. The domain category of the sample is predicted. At the same time, the two tasks of random mask mechanism and domain category recognition are used to train the pre-trained language sub-model, which can allow the pre-trained language sub-model to learn more semantic information.

例如，抽取的文本数据样本可以为“Shipping company Green Marine leasedcontainer ships longer grant round in the Egyptian Suez Canal stranded re-float rescue 6days,the canal restore traffic.”。如图2e所示，在将该文本数据样本输入到预训练语言子模型之前，可以将该文本数据样本处理成“[CLS]Shipping companyGreen Marine leased container ships longer grant round in the Egyptian SuezCanal stranded re-float rescue 6days,the canal restore traffic.[SEP]”，并对该文本数据样本进行随机Mask处理，得到“[CLS]Shipping[MASK]Green[MASK] leasedcontainer…[SEP]”。For example, a sample of extracted text data could be "Shipping company Green Marine leasedcontainer ships longer grant round in the Egyptian Suez Canal stranded re-float rescue 6days, the canal restore traffic.". As shown in Figure 2e, this text data sample can be processed into "[CLS] Shipping companyGreen Marine leased container ships longer grant round in the Egyptian SuezCanal stranded re-float before feeding it into the pretrained language submodel. rescue 6days, the canal restore traffic.[SEP]", and perform random mask processing on the text data sample to get "[CLS]Shipping[MASK]Green[MASK] leasedcontainer...[SEP]".

将进行随机Mask处理后的文本数据样本输入到预训练语言子模型后，预训练子模型可以对该文本数据样本进行向量化，得到“E[CLS]、E1、E2、 E[MASK]、E4、E[MASK]、E6、E[…]、E[SEP]”，进而得到输出结果“C、T1、 T2、T[MASK]、T4、T[MASK]、T6、T[…]、T[SEP]”。After inputting the text data sample after random mask processing into the pre-training language sub-model, the pre-training sub-model can vectorize the text data sample to obtain "E[CLS], E1, E2, E[MASK], E4 , E[MASK], E6, E[…], E[SEP]”, and then get the output result “C, T1, T2, T[MASK], T4, T[MASK], T6, T[…], T [SEP]".

将抽取的文本数据样本进行随机Mask处理后，再输入到预训练语言子模型中进行训练，可以让预训练语言子模型通过上下文的理解预测出被遮盖或替换的部分，即对于输入的“[CLS]Shipping[MASK]Green[MASK]leased container…[SEP]”，预训练语言子模型可以通过上下文的理解预测出第一个“[MASK]”为“company”，第二个“[MASK]”为“Marine”。The extracted text data samples are subjected to random mask processing, and then input into the pre-trained language sub-model for training, so that the pre-trained language sub-model can predict the covered or replaced part through contextual understanding, that is, for the input "[[ CLS]Shipping[MASK]Green[MASK]leased container...[SEP]", the pre-trained language sub-model can predict that the first "[MASK]" is "company" and the second "[MASK] " for "Marine".

可选的，本申请实施例中的预训练语言子模型可以采用基于Transformer 的双向编码器表征(Bidirectional Encoder Representations from Transformers， BERT)来实现。Optionally, the pre-trained language sub-model in this embodiment of the present application may be implemented by using a Transformer-based bidirectional encoder representation (Bidirectional Encoder Representations from Transformers, BERT).

本申请实施例中，采用百科文本数据和带有领域类别标注的领域文本数据作为训练数据，对预训练语言子模型进行训练，可以让预训练语言子模型学习到更多的语义信息，从而可以使得在对文本数据进行命名实体识别时，可以利用文本数据中的语义信息，更快更好地识别出文本数据中属于命名实体的目标词语。In the embodiment of the present application, the encyclopedia text data and the domain text data marked with the domain category are used as training data to train the pre-trained language sub-model, so that the pre-trained language sub-model can learn more semantic information, so that it can learn more semantic information. Therefore, when performing named entity recognition on text data, the semantic information in the text data can be used to recognize the target words belonging to the named entities in the text data faster and better.

步骤S204，从嵌入向量样本和基于命名实体识别文本数据样本得到的词语向量样本中，抽取向量样本。Step S204, extracting vector samples from the embedding vector samples and the word vector samples obtained based on the named entity recognition text data samples.

在该实施例中，可以对训练数据集的命名实体识别文本数据样本进行向量化，得到对应的词语向量样本。在对命名实体识别文本数据样本进行向量化时，可以采用WordEmbedding方法，得到命名实体识别文本数据样本对应的词语向量样本，本申请实施例对从命名实体识别文本数据样本得到词语向量样本的方式不作限定。In this embodiment, the named entity recognition text data samples of the training data set may be vectorized to obtain corresponding word vector samples. When vectorizing the named entity recognition text data samples, the WordEmbedding method can be used to obtain the word vector samples corresponding to the named entity recognition text data samples. The embodiment of the present application does not make any method for obtaining the word vector samples from the named entity recognition text data samples. limited.

在得到命名实体识别文本数据样本对应的词语向量样本后，可以从基于预训练语言子模型得到的嵌入向量样本和词语向量样本中，抽取向量样本。After obtaining the word vector samples corresponding to the named entity recognition text data samples, the vector samples can be extracted from the embedding vector samples and the word vector samples obtained based on the pre-trained language sub-model.

步骤S205，将抽取的向量样本输入到命名实体识别子模型中，得到相应的至少一个目标词语及其各自对应的目标词语类别。Step S205: Input the extracted vector samples into the named entity recognition sub-model to obtain at least one corresponding target word and its corresponding target word category.

将将抽取的向量样本输入到命名实体识别子模型中，基于命名实体识别子模型，可以识别出文本数据样本中属于命名实体的至少一个目标词语，以及每个目标词语各自对应的命名实体类别。Input the extracted vector samples into the named entity recognition sub-model. Based on the named entity recognition sub-model, at least one target word belonging to the named entity in the text data sample can be identified, and the named entity category corresponding to each target word.

具体地，如图2f所示，假设抽取的文本数据样本为“[CLS]Shipping companyGreen Marine leased container ships longer…[SEP]”，该文本数据样本具有的标注为“[CLS]O O B-ORG E-ORG O O O O…[SEP]”。其中，O表示非命名实体，B表示命名实体的开始位置，E表示命名实体的结束位置，S表示单独命名实体，后接的标签表示实体类别，“-LOC”表示地点实体类别，“-ORG”表示组织机构实体类别。Specifically, as shown in Figure 2f, it is assumed that the extracted text data sample is "[CLS] Shipping company Green Marine leased container ships longer...[SEP]", and the text data sample has the label "[CLS] O O B-ORG E -ORG O O O O…[SEP]”. Among them, O represents the non-named entity, B represents the start position of the named entity, E represents the end position of the named entity, S represents the single named entity, the following label represents the entity category, "-LOC" represents the location entity category, "-ORG" ” indicates the organization entity category.

将具有标注的该文本数据样本输入到预训练语言子模型中，通过预训练语言子模型可以得到对应的嵌入向量样本“E[CLS]、E1、E2、E3、E4、E5、E6、 E7、E8、E…、E[SEP]”。Input the labeled text data samples into the pre-training language sub-model, and the corresponding embedding vector samples "E[CLS], E1, E2, E3, E4, E5, E6, E7, E[CLS], E1, E2, E3, E4, E5, E6, E7, E8, E…, E[SEP]”.

将得到的嵌入向量样本输入到命名实体识别子模型中，对模型进行训练。命名实体识别子模型中包括两个子模型，一个是边界识别子模型(Border Segment Model)，用于识别出文本数据中的命名实体边界，另一个是类别识别子模型(Named EntityRecognition Model，NER Model)，用于识别出文本数据中的命名实体词语所属的命名实体类别。经过对命名实体识别子模型中包括的两个子模型的训练，将两个子模型的预测结果相结合得到最终的预测结果。The obtained embedding vector samples are input into the named entity recognition sub-model to train the model. The named entity recognition sub-model includes two sub-models, one is the border recognition sub-model (Border Segment Model), which is used to identify the named entity boundary in the text data, and the other is the category recognition sub-model (Named EntityRecognition Model, NER Model) , which is used to identify the named entity category to which named entity words in text data belong. After training the two sub-models included in the named entity recognition sub-model, the prediction results of the two sub-models are combined to obtain the final prediction result.

将嵌入向量样本“E[CLS]、E1、E2、E3、E4、E5、E6、E7、E8、E…、 E[SEP]”输入到命名实体识别子模型后，基于命名实体识别子模型中的边界识别子模型，可以识别出文本数据样本中的命名实体边界，即通过边界识别子模型，对文本“[CLS]Shipping company GreenMarine leased container ships longer…[SEP]”进行识别，可以得到“O O B E O O OO…”，基于识别得到的命名实体边界和嵌入向量样本，通过命名实体识别子模型中的类别识别子模型，可以得到每个命名实体词语所属的命名实体类别，即通过类别识别子模型，对文本“[CLS]Shipping company Green Marine leased container ships longer…[SEP]”进行识别，可以得到“O O ORG ORG O O O O…O”。进而将边界识别子模型的识别结果与类别识别子模型的识别结果相结合，可以得到对文本“[CLS]Shipping company GreenMarine leased container ships longer…[SEP]”的预测结果，即预测出文本“[CLS]Shipping company Green Marine leased container ships longer…[SEP]”中的命名实体词语以及命名实体词语所属的命名实体类别。After inputting the embedded vector samples "E[CLS], E1, E2, E3, E4, E5, E6, E7, E8, E..., E[SEP]" into the named entity recognition sub-model, based on the named entity recognition sub-model The boundary recognition sub-model of , can identify the named entity boundary in the text data sample, that is, through the boundary recognition sub-model, the text “[CLS] Shipping company GreenMarine leased container ships longer…[SEP]” can be recognized, and the result is “O O B E O O OO...", based on the identified named entity boundaries and embedded vector samples, through the category identification submodel in the named entity identification submodel, the named entity category to which each named entity word belongs can be obtained, that is, through the category identification submodel, the The text "[CLS]Shipping company Green Marine leased container ships longer...[SEP]" can be identified, and one can get "O O ORG ORG O O O O...O". Then, combining the recognition results of the boundary recognition sub-model with the recognition results of the category recognition sub-model, the prediction result of the text "[CLS] Shipping company GreenMarine leased container ships longer...[SEP]" can be obtained, that is, the text "[ CLS]Shipping company Green Marine leased container ships longer…[SEP]” and the named entity class to which the named entity term belongs.

基于命名实体识别子模型，对文本“[CLS]Shipping company Green Marineleased container ships longer…[SEP]”进行识别，得到的预测结果为“[CLS]O O B-ORGE-ORG O O O O…[SEP]”，根据预测结果可以知道该文本中的“Shipping”、“company”、“leased”、“container”、“ships”和“longer”均为非命名实体，“Green Marine”为命名实体，且“Green Marine”为组织机构命名实体。Based on the named entity recognition sub-model, the text "[CLS]Shipping company Green Marineleased container ships longer…[SEP]" is recognized, and the predicted result is "[CLS]O O B-ORGE-ORG O O O O...[SEP]", According to the prediction results, we can know that "Shipping", "company", "leased", "container", "ships" and "longer" in the text are all non-named entities, "Green Marine" is a named entity, and "Green Marine" is a named entity. ” names the entity for the organization.

本申请实施例中，采用预训练语言子模型输出的嵌入向量和基于带有命名实体类别标注的命名实体识别文本样本得到的词语向量作为训练数据，对命名实体识别子模型进行训练，可以利用在预训练语言子模型阶段学习到的文本数据中多个词整体的语义信息，更加准确地识别出文本数据中属于命名实体的目标词语。In the embodiment of the present application, the embedding vector output by the pre-trained language sub-model and the word vector obtained based on the named entity recognition text sample with the named entity category label are used as training data to train the named entity recognition sub-model. The overall semantic information of multiple words in the text data learned in the pre-training language sub-model stage can more accurately identify the target words belonging to named entities in the text data.

百科文本数据和带有领域类别标注的领域文本数据作为训练数据，对预训练语言子模型进行训练，可以让预训练语言子模型学习到更多的语义信息，从而可以使得在对文本数据进行命名实体识别时，可以利用文本数据中的语义信息，更快更好地识别出文本数据中属于命名实体的目标词语。Encyclopedia text data and domain text data with domain category annotations are used as training data to train the pre-trained language sub-model, so that the pre-trained language sub-model can learn more semantic information, which can make it possible to name the text data. During entity recognition, the semantic information in the text data can be used to identify the target words belonging to the named entities in the text data faster and better.

步骤S206，根据目标词语类别与领域类别，确定第一损失值。Step S206: Determine the first loss value according to the target word category and the field category.

根据命名实体识别子模型识别得到的目标词语所对应的目标词语类别，与目标词语所在的文本数据样本所标注的领域类别，可以确定出第一损失值。通常，损失值是判定实际的输出与期望的输出的接近程度。损失值越小，说明实际的输出越接近期望的输出。The first loss value can be determined according to the target word category corresponding to the target word identified by the named entity recognition sub-model and the domain category marked by the text data sample where the target word is located. Typically, the loss value is a measure of how close the actual output is to the desired output. The smaller the loss value, the closer the actual output is to the desired output.

步骤S207，确定第一损失值是否收敛至预设的目标值；如果否，执行步骤 S208；如果是，执行步骤S209。In step S207, it is determined whether the first loss value converges to the preset target value; if not, step S208 is performed; if yes, step S209 is performed.

判断第一损失值是否收敛至预设的目标值，如果第一损失值小于或等于预设的目标值，或者，连续N次训练得到的第一损失值的变化幅度小于或等于预设的目标值时，认为第一损失值已收敛至预设的目标值，说明第一损失值收敛；否则，说明第一损失值尚未收敛。Judging whether the first loss value converges to the preset target value, if the first loss value is less than or equal to the preset target value, or the variation range of the first loss value obtained from N consecutive trainings is less than or equal to the preset target value value, it is considered that the first loss value has converged to the preset target value, indicating that the first loss value has converged; otherwise, it indicates that the first loss value has not converged.

步骤S208，根据确定的第一损失值对待训练的预训练语言子模型的参数进行调整。Step S208: Adjust the parameters of the pre-trained language sub-model to be trained according to the determined first loss value.

如果第一损失值未收敛，则对预训练语言子模型的模型参数进行调整，调整模型参数后，返回执行步骤S202，继续下一轮的训练过程。If the first loss value does not converge, then adjust the model parameters of the pre-trained language sub-model, and after adjusting the model parameters, return to step S202 to continue the next round of training.

步骤S209，根据目标词语类别与领域类别、命名实体类别，确定第二损失值。Step S209: Determine the second loss value according to the target word category, domain category, and named entity category.

根据命名实体识别子模型识别得到的目标词语所对应的目标词语类别，与目标词语所在的文本数据样本所标注的领域类别或命名实体类别，可以确定出第二损失值。通常，损失值是判定实际的输出与期望的输出的接近程度。损失值越小，说明实际的输出越接近期望的输出。The second loss value can be determined according to the target word category corresponding to the target word identified by the named entity recognition sub-model, and the domain category or named entity category marked by the text data sample where the target word is located. Typically, the loss value is a measure of how close the actual output is to the desired output. The smaller the loss value, the closer the actual output is to the desired output.

本申请实施例中，利用命名实体识别子模型输出的目标词语对应的目标词语类别，与抽取的文本数据样本中标注的领域类别，确定出的第一损失值对预训练语言子模型的模型参数进行调整，以及命名实体识别子模型输出的目标词语对应的目标词语类别，与抽取的向量样本中标注的领域类别或命名实体类别，确定出的第二损失值对命名实体识别子模型的模型参数进行调整，可以先完成预训练语言子模型的训练过程，并在预训练语言子模型结束训练之后，预训练子模型的权重就不再参与命名实体识别子模型的训练了，以减少模型的训练时间，并且能够使得命名实体识别子模型更好的结合边界识别和类别预测两个子任务，更加准确地识别出文本中属于命名实体的目标词语，并将目标词语准确地归属至对应的命名实体类别中。In the embodiment of the present application, the target word category corresponding to the target word output by the named entity recognition sub-model and the domain category marked in the extracted text data sample are used, and the determined first loss value is used for the model parameters of the pre-trained language sub-model. Make adjustments, and the target word category corresponding to the target word output by the named entity recognition sub-model, and the domain category or named entity category marked in the extracted vector samples, and the determined second loss value is the model parameter of the named entity recognition sub-model. For adjustment, the training process of the pre-trained language sub-model can be completed first, and after the pre-trained language sub-model finishes training, the weight of the pre-trained sub-model will no longer participate in the training of the named entity recognition sub-model, so as to reduce the training of the model. time, and can make the named entity recognition sub-model better combine the two sub-tasks of boundary recognition and category prediction, more accurately identify the target words belonging to named entities in the text, and accurately attribute the target words to the corresponding named entity category. middle.

步骤S210，确定第二损失值是否收敛至预设的目标值；如果否，执行步骤S211；如果是，执行步骤S212。In step S210, it is determined whether the second loss value converges to the preset target value; if not, step S211 is performed; if yes, step S212 is performed.

判断第二损失值是否收敛至预设的目标值，如果第二损失值小于或等于预设的目标值，或者，连续N次训练得到的第二损失值的变化幅度小于或等于预设的目标值时，认为第二损失值已收敛至预设的目标值，说明第二损失值收敛；否则，说明第二损失值尚未收敛。Judging whether the second loss value converges to the preset target value, if the second loss value is less than or equal to the preset target value, or the variation range of the second loss value obtained from N consecutive trainings is less than or equal to the preset target value value, it is considered that the second loss value has converged to the preset target value, indicating that the second loss value has converged; otherwise, it indicates that the second loss value has not converged.

步骤S211，根据确定的第二损失值对待训练的命名实体识别子模型的参数进行调整。Step S211: Adjust the parameters of the named entity recognition sub-model to be trained according to the determined second loss value.

如果第二损失值未收敛，则对命名实体识别子模型的模型参数进行调整，调整模型参数后，返回执行步骤S204，继续下一轮的训练过程。If the second loss value does not converge, then adjust the model parameters of the named entity recognition sub-model, and after adjusting the model parameters, return to step S204 to continue the next round of training process.

步骤S212，结束训练得到已训练的类别预测模型。Step S212, end the training to obtain the trained category prediction model.

如果第一损失值和第二损失值均收敛，则将当前得到的类别预测模型作为已训练的类别预测模型。If both the first loss value and the second loss value converge, the currently obtained category prediction model is used as the trained category prediction model.

本申请实施例中，采用包括百科文本数据样本、带有领域类别标注的领域文本数据样本，以及带有命名实体类别标注的命名实体识别文本数据样本的训练数据集，对类别预测模型中的预训练语言子模型和命名实体识别子模型进行训练，可以让模型学习到文本数据中更多的语义信息，并根据文本数据中的语义信息，更加准确地识别出文本数据中的命名实体，以及将命名实体划分到对应的命名实体类别中。此外，在对类别预测模型训练完成之后，还可以根据已有的训练数据集构建得到基础词典，从而完成词典的构造。In the embodiment of the present application, a training data set including encyclopedia text data samples, domain text data samples with domain category annotations, and named entity recognition text data samples with named entity category annotations is used to analyze the predictions in the category prediction model. Training the language sub-model and the named entity recognition sub-model allows the model to learn more semantic information in the text data, and based on the semantic information in the text data, more accurately identify the named entities in the text data, and Named entities are divided into corresponding named entity categories. In addition, after the training of the category prediction model is completed, a basic dictionary can be constructed according to the existing training data set, thereby completing the construction of the dictionary.

在得到已训练的类别预测模型后，基于包括百科文本数据样本、领域文本数据样本和命名实体识别文本数据样本的训练数据集，可以构造一个基础词典，该基础词典中包含有多个命名实体类别，每个命名实体类别中可以包含有多个命名实体词语。After obtaining the trained category prediction model, based on the training data set including encyclopedia text data samples, domain text data samples and named entity recognition text data samples, a basic dictionary can be constructed, which contains multiple named entity categories , each named entity category can contain multiple named entity terms.

具体地，在通过预训练语言子模型得到文本数据样本对应的嵌入向量样本之后，将嵌入向量样本输入到命名实体识别子模型中，对该模型进行训练的过程中，命名实体识别子模型的命名实体识别任务主要包括两个子任务，一是命名实体边界的识别，二是命名实体类别的识别。使用BIO/BIOES标注法可以同时为一个词标注边界和类别，基于此标注法，以往的NER模型将NER问题处理成多标签分类任务，直接预测每个词的命名实体位置和类别。这样的处理方式较好的结合了边界识别和类别预测两个子任务，但是仍然存在两个主要问题。第一，此类模型在命名实体类别较少时表现出色，但在类别较多的情况下难以准确的预测每个词的实体类别；第二，模型识别过程中只能用到单个词的语义信息，对于多个词组成的命名实体，无法利用多个词整体的语义信息。Specifically, after obtaining the embedding vector samples corresponding to the text data samples through the pre-training language sub-model, the embedding vector samples are input into the named entity recognition sub-model, and in the process of training the model, the naming of the named entity recognition sub-model The entity recognition task mainly includes two sub-tasks, one is the identification of named entity boundaries, and the other is the identification of named entity categories. Using the BIO/BIOES tagging method can simultaneously tag the boundary and category for a word. Based on this tagging method, the previous NER model treats the NER problem as a multi-label classification task and directly predicts the named entity location and category of each word. This processing method combines the two subtasks of boundary recognition and category prediction well, but there are still two main problems. First, this type of model performs well when there are few named entity categories, but it is difficult to accurately predict the entity category of each word when there are many categories; second, only the semantics of a single word can be used in the model recognition process. Information, for named entities composed of multiple words, the semantic information of the whole of multiple words cannot be used.

本申请实施例中的命名实体识别子模型同样在两个子任务上进行训练，但结合方式与以往的模型不同。首先利用边界标注信息，训练模型识别命名实体的边界，根据预测结果将文本划分为不同的语义单元，同一单元内的词视为属于同一命名实体，将其词向量进行平均后再赋给每个词，从而利用命名实体整体的语义信息。The named entity recognition sub-model in the embodiment of the present application is also trained on two sub-tasks, but the combination method is different from the previous model. First, use the boundary annotation information to train the model to identify the boundaries of named entities, and divide the text into different semantic units according to the prediction results. Words in the same unit are regarded as belonging to the same named entity, and their word vectors are averaged and then assigned to each words, so as to utilize the semantic information of the named entity as a whole.

对于预训练语言子模型和命名实体识别子模型的训练可以通过两种方式来实现，一种是fine-tune方法，即对预训练语言子模型和命名实体识别子模型两个模型的所有参数统一进行调整；另一种是feature-base方法，即直接将预训练语言子模型的输出作为嵌入向量，作为命名实体识别子模型的输入，且将预训练语言子模型的参数固定，仅需训练命名实体识别子模型。本申请实施例中所采用的训练方式为feature-base方法，在预训练语言子模型的训练过程结束后，预训练语言子模型的权重就不再参与训练，减少模型训练时间，并且能使命名实体识别子模型更好的结合边界识别和类别预测两个子任务。The training of the pre-trained language sub-model and the named entity recognition sub-model can be implemented in two ways. One is the fine-tune method, which unifies all the parameters of the pre-trained language sub-model and the named entity recognition sub-model. Adjustment; the other is the feature-base method, that is, the output of the pre-trained language sub-model is directly used as the embedding vector, as the input of the named entity recognition sub-model, and the parameters of the pre-trained language sub-model are fixed, and only the training name is required. Entity Recognition Submodel. The training method adopted in the embodiment of the present application is the feature-base method. After the training process of the pre-trained language sub-model is completed, the weight of the pre-trained language sub-model will no longer participate in the training, which reduces the model training time and enables the naming of The entity recognition sub-model better combines the two sub-tasks of boundary recognition and class prediction.

在一种实施例中，在获取到训练数据集后，还可以对训练数据集中的多个文本数据样本进行分析和处理。图3示出了本申请实施例中对训练数据集中的训练数据进行分析和处理的具体过程，如图3所示，该过程可以包括以下步骤：In one embodiment, after the training data set is acquired, multiple text data samples in the training data set may also be analyzed and processed. FIG. 3 shows a specific process of analyzing and processing training data in the training data set in the embodiment of the present application. As shown in FIG. 3 , the process may include the following steps:

步骤S301，获取包括百科文本数据、领域文本数据和命名实体识别文本数据的训练数据集。Step S301, acquiring a training data set including encyclopedia text data, domain text data and named entity recognition text data.

其中，百科文本数据为无针对性领域的文本数据，领域文本数据为经挑选的与特定命名实体类别高度相关，或含有较多某类别命名实体的文本数据，命名实体识别文本数据为需要进行词级命名实体标注的文本数据。Among them, the encyclopedia text data is the text data of untargeted fields, the field text data is the selected text data that is highly related to a specific named entity category, or contains more named entities of a certain category, and the named entity recognition text data is the word that needs to be identified. Text data for level named entity annotations.

步骤S302，对百科文本数据、领域文本数据和命名实体识别文本数据分别进行预处理操作。Step S302, preprocessing operations are respectively performed on the encyclopedia text data, the domain text data and the named entity recognition text data.

在获取到百科文本数据、领域文本数据和命名实体识别文本数据后，可以对百科文本数据、领域文本数据和命名实体识别文本数据，分别进行预处理操作。其中，预处理操作可以包括数据筛选和格式转换中的至少一种。具体地，在获取到百科文本数据样本、领域文本数据样本和命名实体识别文本数据样本后，可以对每个文本数据样本分别进行数据清洗、去噪和序列化操作。其中，数据清洗可以包括缺失值清洗、格式内容清洗、逻辑错误清洗、非需求数据清洗以及关联性验证；数据去噪可以包括异常值填补、基于距离检测等；数据序列化是将数据转化成标准格式。After the encyclopedia text data, the domain text data and the named entity recognition text data are obtained, the encyclopedia text data, the domain text data and the named entity recognition text data can be preprocessed respectively. Wherein, the preprocessing operation may include at least one of data filtering and format conversion. Specifically, after obtaining the encyclopedia text data samples, the domain text data samples and the named entity recognition text data samples, data cleaning, denoising and serialization operations can be performed on each text data sample respectively. Among them, data cleaning can include missing value cleaning, format content cleaning, logic error cleaning, non-required data cleaning and correlation verification; data denoising can include outlier filling, distance-based detection, etc.; data serialization is to convert data into standard Format.

步骤S303，得到相应的百科文本序列、领域文本序列和命名实体识别文本序列。Step S303, obtain the corresponding encyclopedia text sequence, domain text sequence and named entity recognition text sequence.

在对百科文本数据、领域文本数据和命名实体识别文本数据，分别进行预处理操作后，可以相应的得到百科文本序列、领域文本序列和命名实体识别文本序列。After preprocessing the encyclopedia text data, the domain text data and the named entity recognition text data respectively, the encyclopedia text sequence, the domain text sequence and the named entity recognition text sequence can be obtained accordingly.

本申请实施例中，通过对训练数据集中的文本数据进行数据清洗操作，可以得到满足数据质量要求的数据，进而再对文本数据进行数据去噪和序列化操作，可以将文本数据转换成标准格式，从而可以基于处理后得到的标准格式的文本数据，对类别预测模型进行训练，能够提高模型的训练效率以及训练结果。In the embodiment of the present application, by performing a data cleaning operation on the text data in the training data set, data that meets the data quality requirements can be obtained, and then the text data can be de-noised and serialized to convert the text data into a standard format , so that the category prediction model can be trained based on the standard format text data obtained after processing, which can improve the training efficiency of the model and the training results.

在得到已训练的类别预测模型后，可以基于该类别预测模型进行词典构造。图4a示出了本申请实施例提供的一种词典构造方法的流程图，该方法可以由服务器执行，例如，图1a中的服务器200。下面结合附图4a，对本申请实施例中的词典构造过程进行详细阐述。After the trained category prediction model is obtained, dictionary construction can be performed based on the category prediction model. Fig. 4a shows a flowchart of a dictionary construction method provided by an embodiment of the present application, and the method can be executed by a server, for example, the server 200 in Fig. 1a. The dictionary construction process in the embodiment of the present application will be described in detail below with reference to FIG. 4a.

步骤S40，获取待处理文本和基础词典。Step S40, acquiring the text to be processed and the basic dictionary.

其中，基础词典为在对类别预测模型进行训练的训练阶段得到的包括多个命名实体类别的词典。并且，在该基础词典的每个命名实体类别中都包含有对应的多个命名实体词语。例如，基础词典中可以包括人名、地名、组织机构名等多个命名实体类别，人名中可以包括多个姓名、地名中可以包括多个地点、组织机构名中可以包括多个组织机构。The basic dictionary is a dictionary including a plurality of named entity categories obtained during the training phase of training the category prediction model. Moreover, each named entity category of the basic dictionary contains a plurality of corresponding named entity words. For example, the basic dictionary may include multiple named entity categories such as person names, place names, and organization names. A person name may include multiple names, a place name may include multiple locations, and an organization name may include multiple organizations.

在一种实施例中，在获取到待处理文本后，还可以对待处理文本进行数据清洗、去噪和序列化操作，得到待处理文本对应的文本序列。In an embodiment, after acquiring the text to be processed, data cleaning, denoising and serialization operations may be performed on the text to be processed to obtain a text sequence corresponding to the text to be processed.

步骤S41，基于已训练的类别预测模型，确定出待处理文本包含的至少一个候选词语，以及至少一个候选词语各自的语义信息。Step S41 , based on the trained category prediction model, determine at least one candidate word included in the text to be processed, and the respective semantic information of the at least one candidate word.

其中，类别预测模型中可以包括预训练语言子模型和命名实体识别子模型。The category prediction model may include a pre-trained language sub-model and a named entity recognition sub-model.

将待处理文本输入到类别预测模型中，基于类别预测模型中的预训练语言子模型，可以得到待处理文本包含的至少一个单词各自对应的词向量。其中，每个词向量可以表征相应单词的语义信息。Inputting the text to be processed into the category prediction model, and based on the pre-trained language sub-model in the category prediction model, word vectors corresponding to at least one word contained in the text to be processed can be obtained. Among them, each word vector can represent the semantic information of the corresponding word.

在得到至少一个单词各自对应的词向量后，基于类别预测模型中的命名实体识别子模型，可以对至少一个单词进行组合，得到至少一个候选词语，以及至少一个候选词语各自的语义信息。其中，每个候选词语中可以包含至少一个单词。After obtaining the word vector corresponding to the at least one word, based on the named entity recognition sub-model in the category prediction model, the at least one word can be combined to obtain at least one candidate word and the respective semantic information of the at least one candidate word. Wherein, each candidate word may contain at least one word.

例如，如图4b所示，待处理文本为“小明到杭州游玩”，在将该待处理文本输入到类别预测模型之前，需要将该待处理文本处理成“[CLS]小明到杭州游玩[SEP]”结构，进而将“[CLS]小明到杭州游玩[SEP]”输入到类别预测模型中，基于预训练语言子模型，可以得到“E[CLS]E1 E2 E3 E4 E5 E6 E7 E[SEP]”。基于得到的“E[CLS]E1 E2 E3 E4 E5 E6 E7 E[SEP]”，可以通过命名实体识别子模型，得到“E[CLS]E1+E2 E1+E2 E3 E4+E5 E4+E5 E6 E7E[SEP]”。For example, as shown in Figure 4b, the to-be-processed text is "Xiao Ming goes to Hangzhou to play". Before inputting the to-be-processed text into the category prediction model, the to-be-processed text needs to be processed into "[CLS] Xiao Ming goes to Hangzhou to play [SEP] ]" structure, and then input "[CLS] Xiaoming to Hangzhou to play [SEP]" into the category prediction model, based on the pre-trained language sub-model, we can get "E[CLS]E1 E2 E3 E4 E5 E6 E7 E[SEP] ". Based on the obtained "E[CLS]E1 E2 E3 E4 E5 E6 E7 E[SEP]", the submodel can be identified by the named entity, and the obtained "E[CLS]E1+E2 E1+E2 E3 E4+E5 E4+E5 E6 E7E [SEP]".

本申请实施例中，基于预训练语言子模型和命名实体识别子模型，可以得到待处理文本中包含的至少一个候选词语，以及至少一个候选词语各自的语义信息，从而可以根据待处理文本的语义信息将待处理文本划分为多个候选词语，进而在进行命名实体识别时，可以利用文本中多个词语整体的语义信息，准确地识别出待处理文本中属于命名实体的目标词语。In the embodiment of the present application, based on the pre-trained language sub-model and the named entity recognition sub-model, at least one candidate word contained in the text to be processed and the respective semantic information of the at least one candidate word can be obtained, so that the semantic information of the text to be processed can be obtained according to the semantics of the text to be processed. The information divides the text to be processed into multiple candidate words, and then when performing named entity recognition, the semantic information of the multiple words in the text as a whole can be used to accurately identify the target words belonging to the named entity in the text to be processed.

步骤S42，通过类别预测模型，根据至少一个候选词语各自的语义信息，选取出符合设定语义条件的至少一个目标词语，并确定至少一个目标词语各自分别属于多个词语类别的概率值。Step S42 , select at least one target word that meets the set semantic condition according to the respective semantic information of the at least one candidate word through the category prediction model, and determine the probability values that each of the at least one target word belongs to multiple word categories.

在确定出待处理文本中包含的至少一个候选词语，以及至少一个候选词语各自的语义信息后，通过命名实体识别子模型，可以从至少一个候选词语中，选取出语义信息属于命名实体的至少一个目标词语。其中，命名实体为具有特定语义的实体名称。After determining at least one candidate word contained in the text to be processed and the respective semantic information of the at least one candidate word, the named entity recognition sub-model can select at least one candidate word whose semantic information belongs to the named entity from the at least one candidate word target word. Among them, named entities are entity names with specific semantics.

针对至少一个目标词语，可以分别执行以下操作：通过命名实体识别子模型，根据一个目标词语的语义信息，确定一个目标词语分别属于多个词语类别的概率值。For at least one target word, the following operations may be respectively performed: by using the named entity recognition sub-model, according to the semantic information of a target word, determine a probability value that a target word belongs to multiple word categories respectively.

例如，如图4c所示，假设基础词典中包括人名、组织机构名和地名共3 种命名实体类别，在根据待处理文本“小明到杭州游玩”，得到对应的“E[CLS] E1+E2 E1+E2 E3 E4+E5E4+E5 E6 E7 E[SEP]”后，可以根据语义信息，从中确定出“E1+E2”、“E4+E5”对应的词语均为语义信息属于命名实体的目标词语。分别对“E1+E2”、“E4+E5”属于人名、组织机构名和地名的概率进行预测，可以得到“E1+E2”分别属于人名、组织机构名和地名的概率值P1、P2、P3，“E4+E5”分别属于人名、组织机构名和地名的概率值P4、P5、P6。For example, as shown in Figure 4c, it is assumed that the basic dictionary includes three named entity categories: person name, organization name and place name. According to the to-be-processed text "Xiao Ming went to Hangzhou to play", the corresponding "E[CLS] E1+E2 E1 After +E2 E3 E4+E5E4+E5 E6 E7 E[SEP]”, it can be determined from the semantic information that the words corresponding to “E1+E2” and “E4+E5” are the target words whose semantic information belongs to named entities. Predict the probability that "E1+E2" and "E4+E5" belong to person's name, organization name and place name respectively, and get the probability values P1, P2, P3 of "E1+E2" belonging to person's name, organization name and place name respectively, " E4+E5” belong to the probability values P4, P5, and P6 of the names of persons, organizations and places, respectively.

本申请实施例中，基于命名实体识别子模型，可以识别出待处理文本中属于命名实体的目标词语，并得到目标词语属于基础词典中的各个命名实体类别的概率值，从而可以准确地识别出待处理文本中的命名实体，并得到命名实体相对于每个命名实体类别的概率。In the embodiment of the present application, based on the named entity recognition sub-model, the target words belonging to the named entities in the text to be processed can be identified, and the probability values of the target words belonging to each named entity category in the basic dictionary can be obtained, so that the target words can be accurately identified. Named entities in the text to be processed and get the probabilities of the named entities relative to each named entity class.

步骤S43，将至少一个目标词语，分别归属至对应的概率值符合设定概率条件的词语类别中。Step S43, at least one target word is assigned to word categories whose corresponding probability values meet the set probability conditions.

在确定出至少一个目标词语分别属于多个词语类别的概率值后，对于每个目标词语，可以基于该目标词语分别属于多个词语类别的概率值，选取出符合设定概率条件的词语类别，作为该目标词语所属的目标类别，并将该目标词语加入至基础词典的该目标类别中。例如，在确定出待处理文本“小明到杭州游玩”中的命名实体“小明”分别属于人名、组织机构名和地名的概率值P1、 P2、P3，“杭州”分别属于人名、组织机构名和地名的概率值P4、P5、P6后，假设概率值P1和概率值P6符合设定概率条件，则可以将人名作为命名实体“小明”的命名实体类别，将地名作为命名实体“杭州”的命名实体类别，并将“小明”加入至基础词典的人名中，将“杭州”加入至基础词典的地名中。After the probability values of at least one target word belonging to multiple word categories are determined, for each target word, based on the probability values of the target word belonging to multiple word categories, a word category that meets the set probability conditions can be selected, As the target category to which the target word belongs, and add the target word to the target category of the basic dictionary. For example, after it is determined that the named entity "Xiao Ming" in the pending text "Xiao Ming goes to Hangzhou to play" belongs to the probability values P1, P2 and P3 of the person's name, the organization name and the place name, respectively, and "Hangzhou" belongs to the person's name, the organization name and the place name respectively. After the probability values P4, P5, and P6, assuming that the probability value P1 and the probability value P6 meet the set probability conditions, the person name can be used as the named entity category of the named entity "Xiao Ming", and the place name can be used as the named entity category of the named entity "Hangzhou". , and add "Xiao Ming" to the names of people in the basic dictionary, and add "Hangzhou" to the place names of the basic dictionary.

上述步骤S43中，对于至少一个目标词语中的一个目标词语，确定该目标词语所属的目标类别，并将该目标词语归属至基础词典的目标类别中的过程可以如图4d所示，包括以下步骤：In the above step S43, for a target word in the at least one target word, the process of determining the target category to which the target word belongs, and attributing the target word to the target category of the basic dictionary can be as shown in Figure 4d, including the following steps: :

步骤S431，基于目标词语分别属于多个词语类别的概率值，确定出最大概率值对应的词语类别，并将词语类别作为目标类别。Step S431 , based on the probability values that the target words belong to multiple word categories respectively, determine the word category corresponding to the maximum probability value, and use the word category as the target category.

其中，目标词语为通过类别预测模型输出的待处理文本中包含的属于命名实体的词语，词语类别为基础词典中包含的命名实体类别。将待处理文本输入类别预测模型后，类别预测模型可以输出待处理文本中包含的属于命名实体的至少一个目标词语的预测结果，且该预测结果为各个目标词语分别属于各个命名实体类别的概率值，形成(类别，概率)结果对。The target words are words belonging to named entities contained in the text to be processed output by the category prediction model, and the word categories are named entity categories contained in the basic dictionary. After inputting the text to be processed into the category prediction model, the category prediction model can output the prediction result of at least one target word that belongs to the named entity contained in the text to be processed, and the prediction result is the probability value of each target word belonging to each named entity category. , forming (category, probability) outcome pairs.

在确定出各个目标词语分别属于各个命名实体类别的概率值后，可以以各个目标词语中的一个目标词语为例，来详细阐述如何将每个目标词语分别归属至基础词典中对应的命名实体类别中的。After determining the probability values of each target word belonging to each named entity category, take one target word in each target word as an example to describe in detail how to assign each target word to the corresponding named entity category in the basic dictionary middle.

将目标词语分别属于各个命名实体类别的概率值降序排列，即按从大到小的顺序对目标词语分别属于各个命名实体类别的概率值进行排序，确定出其中的最大概率值，并将最大概率值对应的命名实体类别作为目标类别。Arrange the probability values of the target words belonging to each named entity category in descending order, that is, sort the probability values of the target words belonging to each named entity category in descending order, determine the maximum probability value among them, and assign the maximum probability value. The named entity category corresponding to the value is used as the target category.

步骤S432，若目标词语属于目标类别的概率值大于第一设定阈值，则将目标词语归属至目标类别中。Step S432, if the probability value of the target word belonging to the target category is greater than the first set threshold, the target word is assigned to the target category.

如果最大概率值大于第一设定阈值，则可以将目标词语归属至目标类别中，即将目标词语归属至基础词典中的目标类别中，从而完成词典扩展。If the maximum probability value is greater than the first set threshold, the target word can be assigned to the target category, that is, the target word can be assigned to the target category in the basic dictionary, thereby completing the dictionary expansion.

例如，目标词语为“杭州”，基础词典中包含的命名实体类别有人名、组织机构名和地名，目标词语“杭州”属于人名的概率值为0.08，目标词语“杭州”属于组织机构名的概率值为0.12，目标词语“杭州”属于地名的概率值为0.98，第一设定阈值为0.8。将目标词语分别属于人名、组织机构名和地名的概率值进行降序排列，可以确定出最大概率值为0.98，且最大概率值对应的命名实体类别为地名，由于最大概率值0.98大于第一设定阈值0.8，则可以将目标词语“杭州”归属至基础词典的地名中。For example, if the target word is "Hangzhou", the named entity categories included in the basic dictionary are human names, organization names, and place names. The probability value of the target word "Hangzhou" belonging to a person's name is 0.08, and the probability value of the target word "Hangzhou" belonging to an organization name is 0.08. is 0.12, the probability value of the target word "Hangzhou" belonging to a place name is 0.98, and the first set threshold is 0.8. The probability values of the target words belonging to the person's name, the organization name and the place name are sorted in descending order, and the maximum probability value can be determined to be 0.98, and the named entity category corresponding to the maximum probability value is the place name. Since the maximum probability value of 0.98 is greater than the first set threshold 0.8, the target word "Hangzhou" can be attributed to the place names of the basic dictionary.

本申请实施例中，在得到目标词语分别属于各个命名实体类别的概率值后，基于最大的概率值超过设定的阈值，可以将目标词语归属至最大的概率值对应的命名实体类别中，完成词典扩展。从而可以将文本中确定出的属于命名实体的目标词语划分到概率值最高的命名实体类别中，提高词典构造的准确率。In this embodiment of the present application, after obtaining the probability values that the target words belong to each named entity category, based on the fact that the maximum probability value exceeds the set threshold, the target word can be assigned to the named entity category corresponding to the maximum probability value, and the completion of Dictionary extension. Therefore, the target words belonging to the named entities determined in the text can be classified into the named entity category with the highest probability value, and the accuracy of the dictionary construction can be improved.

步骤S433，若目标词语分别属于多个词语类别的概率值不大于第二设定阈值，则选取出概率值大于第三设定阈值的至少一个词语类别，作为候选类别。Step S433 , if the probability value of the target word belonging to multiple word categories is not greater than the second preset threshold, select at least one word category whose probability value is greater than the third preset threshold as a candidate category.

其中，第三设定阈值小于第二设定阈值，第二设定阈值可以等于第一设定阈值，也可以不等于第一设定阈值。Wherein, the third set threshold is smaller than the second set threshold, and the second set threshold may be equal to or not equal to the first set threshold.

且当第二设定阈值等于第一设定阈值时，就相当于若目标词语分别属于多个词语类别的概率值中的最大概率值不大于第一设定阈值，则选取出概率值大于第三设定阈值的至少一个词语类别，作为候选类别。And when the second set threshold is equal to the first set threshold, it is equivalent to selecting a probability value greater than the first set threshold if the maximum probability value among the probability values of the target words belonging to multiple word categories is not greater than the first set threshold. 3. Set at least one word category of the threshold as a candidate category.

当目标词语分别属于多个词语类别的概率值不大于第二设定阈值时，可以将目标词语分别属于各个命名实体类别的概率值降序排列，选取出其中概率值大于第三设定阈值的至少一个命名实体类别，作为候选类别。When the probability values of the target words belonging to multiple word categories are not greater than the second set threshold, the probability values of the target words belonging to each named entity category may be sorted in descending order, and at least one of the probability values greater than the third set threshold is selected. A named entity category, as a candidate category.

步骤S434，基于目标词语分别与至少一个候选类别各自包含的词语的相似度，选取出符合设定相似度条件的候选类别，作为目标类别，并将目标词语归属至目标类别中。Step S434 , based on the similarity between the target word and the words contained in at least one candidate category, select a candidate category that meets the set similarity condition as the target category, and assign the target word to the target category.

在选取出至少一个候选类别后，可以确定出目标词语分别与至少候选类别各自包含的词语的相似度，并从该相似度中，选取出相似度最高的候选类别，作为目标类别，并将目标词语归属至基础词典的目标类别中。After selecting at least one candidate category, the similarity between the target word and the words contained in at least the candidate category can be determined, and from the similarity, the candidate category with the highest similarity is selected as the target category, and the target Words are assigned to the target category of the base dictionary.

本申请实施例中，在确定出目标词语分别属于各个命名实体类别的概率值后，还可以根据概率值从命名实体类别中选取出多个候选类别，并将目标词语分别与每个候选类别各自包含的词语进行相似度比较，再根据相似度，确定出目标类别，进而将目标词语归属至目标类别中，完成词典扩展。从而基于相似度判断，可以将文本中确定出的属于命名实体的目标词语划分到与目标词语具有最高相似度的词语对应的目标类别中，提高词典构造的准确率。In the embodiment of the present application, after determining the probability values that the target words belong to each named entity category, a plurality of candidate categories can also be selected from the named entity categories according to the probability values, and the target words and each candidate category The included words are compared for similarity, and then the target category is determined according to the similarity, and then the target words are assigned to the target category to complete the dictionary expansion. Therefore, based on the similarity judgment, the determined target words belonging to the named entities in the text can be classified into the target categories corresponding to the words with the highest similarity to the target words, so as to improve the accuracy of dictionary construction.

在一种实施例中，对于每个候选类别，可以使用WordNet词典工具来确定目标词语与该候选类别中包含的已有词集合的相似度，并在确定出目标词语分别与每个候选类别中已有词集合的相似度后，从各个相似度中选取出相似度最高的候选类别，作为目标类别，将目标词语归属至基础词典的目标类别中，完成词典扩展。In one embodiment, for each candidate category, the WordNet dictionary tool can be used to determine the similarity between the target word and the existing word set contained in the candidate category, and after determining that the target word is respectively associated with each candidate category After the similarity of the word set has been obtained, the candidate category with the highest similarity is selected from each similarity as the target category, and the target word is assigned to the target category of the basic dictionary to complete the dictionary expansion.

例如，目标词语为“纽约”，基础词典中包含的命名实体类别有人名、组织机构名、地名、节日、产品名和时间，目标词语“纽约”属于人名的概率值为0.45，目标词语“纽约”属于组织机构名的概率值为0.67，目标词语“纽约”属于地名的概率值为0.8，目标词语“纽约”属于节日的概率值为0.32，目标词语“纽约”属于产品名的概率值为0.76，目标词语“纽约”属于人名的概率值为0.06，第二设定阈值为0.9，第三设定阈值为0.5。由于目标词语“纽约”分别属于人名、组织机构名、地名、节日、产品名和时间的概率值均不大于第二设定阈值0.9，则可以选取出概率值大于第三设定阈值0.5的多个命名实体类别，作为候选类别，即选取出的候选类别为组织机构名、地名和产品名。确定目标词语“纽约”分别与组织机构名、地名和产品名中包含的已有词集合的相似度，假设目标词语“纽约”与组织机构名中包含的已有词集合的相似度为0.06，与地名中包含的已有词集合的相似度为0.96，与产品名中包含的已有词集合的相似度为0.21，则可以将地名作为目标词语“纽约”对应的目标类别，并将目标词语“纽约”归属至基础词典的地名中。For example, the target word is "New York", and the named entity categories included in the basic dictionary are person's name, organization name, place name, holiday, product name and time. The target word "New York" belongs to the person's name with a probability value of 0.45. The probability value of belonging to the organization name is 0.67, the probability value of the target word "New York" belonging to the place name is 0.8, the probability value of the target word "New York" belonging to the festival is 0.32, the probability value of the target word "New York" belonging to the product name is 0.76, The probability value of the target word "New York" belonging to a person's name is 0.06, the second set threshold is 0.9, and the third set threshold is 0.5. Since the probability value of the target word "New York" belonging to the person's name, organization name, place name, festival, product name and time respectively is not greater than the second set threshold of 0.9, it is possible to select a plurality of probability values greater than the third set threshold of 0.5. Named entity category, as a candidate category, that is, the selected candidate categories are organization name, place name and product name. Determine the similarity between the target word "New York" and the existing word set contained in the organization name, place name and product name, assuming that the similarity between the target word "New York" and the existing word set contained in the organization name is 0.06, The similarity with the existing word set contained in the place name is 0.96, and the similarity with the existing word set contained in the product name is 0.21, then the place name can be used as the target category corresponding to the target word "New York", and the target word "New York" belongs to the place-names of the base dictionary.

综上所述，根据一个目标词语分别属于各个命名实体类别的概率值，将一个目标词语归属至基础词典中的对应目标类别的详细过程还可以如图5所示，包括以下步骤：To sum up, according to the probability values of a target word belonging to each named entity category, the detailed process of attributing a target word to the corresponding target category in the basic dictionary can also be shown in Figure 5, including the following steps:

步骤S501，确定目标词语分别属于各个命名实体类别的概率值。Step S501: Determine the probability values that the target words belong to each named entity category respectively.

步骤S502，将概率值降序排列，并确定出最大概率值。Step S502: Arrange the probability values in descending order, and determine the maximum probability value.

步骤S503，确定最大概率值是否大于第一设定阈值；如果否，执行步骤 S504；如果是，执行步骤S506。Step S503, determine whether the maximum probability value is greater than the first set threshold; if not, go to step S504; if so, go to step S506.

步骤S504，选取出概率值大于第三设定阈值的至少一个命名实体类别，作为候选类别。Step S504, at least one named entity category with a probability value greater than the third preset threshold is selected as a candidate category.

步骤S505，根据目标词语与分别与至少一个候选类别中各自包含的词语的相似度，选取出相似度最高的候选类别作为目标类别，并将目标词语归属至目标类别中。Step S505, according to the similarity between the target word and the words contained in at least one candidate category, select the candidate category with the highest similarity as the target category, and assign the target word to the target category.

步骤S506，将最大概率值对应的命名实体类别，作为目标类别，并将目标词语归属至目标类别中。Step S506, take the named entity category corresponding to the maximum probability value as the target category, and assign the target word to the target category.

本申请实施例提供的词典构造方法，可以基于包括百科文本数据、带有领域类别标注的领域文本数据，以及带有命名实体类别标注的命名实体识别文本样本的训练数据集，对类别预测模型进行训练，得到已训练的类别预测模型，并基于训练数据集构建包括多个命名实体类别的基础词典。在得到已训练的类别预测模型后，基于类别预测模型，识别出待处理文本中属于命名实体的目标词语，以及目标词语分别属于基础词典中的各个命名实体类别的概率值。在得到概率值后，基于最大概率值大于设定的阈值，将目标词语归属至最大概率值对应的命名实体类别中，完成词典的扩展，或者若最大概率值不大于设定的阈值，按照概率值从中选取出多个命名实体类别作为候选类别，并确定目标词语与每个候选类别包含的词语的相似度，基于相似度，将目标词语归属至相似度最高对应的命名实体类别中，完成词典的扩展。从而有效地规避了依赖领域专家设计的复杂的规则或者大量的人工标注工作，有效地利用了海量的自然语言文本和深度模型学习文本语义表征，最终高效高质量的实现了词典的构造和扩展。同时，也实现了判断新的语料信息能够更加灵活和高效，并且词典匹配的速度快，大大提升了词典构造的准确率，在一定程度上缓和了性能问题。The dictionary construction method provided by the embodiment of the present application can perform a class prediction model based on a training data set including encyclopedia text data, domain text data with domain category annotations, and named entity recognition text samples with named entity category annotations. Train, get the trained category prediction model, and build a basic dictionary including multiple named entity categories based on the training data set. After the trained category prediction model is obtained, based on the category prediction model, the target words belonging to the named entities in the text to be processed are identified, and the probability values that the target words belong to each named entity category in the basic dictionary respectively. After obtaining the probability value, based on the maximum probability value greater than the set threshold, the target word is assigned to the named entity category corresponding to the maximum probability value to complete the expansion of the dictionary, or if the maximum probability value is not greater than the set threshold value, according to the probability Select multiple named entity categories from the value as candidate categories, and determine the similarity between the target word and the words contained in each candidate category. Based on the similarity, the target word is assigned to the named entity category corresponding to the highest similarity, and the dictionary is completed. extension. Therefore, it effectively avoids the complex rules designed by domain experts or a large number of manual annotation work, effectively utilizes massive natural language texts and deep models to learn text semantic representation, and finally realizes the construction and expansion of dictionaries with high efficiency and high quality. At the same time, the judgment of new corpus information can be more flexible and efficient, and the speed of dictionary matching is fast, which greatly improves the accuracy of dictionary construction and alleviates performance problems to a certain extent.

参阅图6所示，下面采用一个具体的应用场景，对以上实施例做出进一步详细说明：Referring to Fig. 6, a specific application scenario is adopted below to further describe the above embodiment in detail:

假设获取到的待处理文本为“明明去参观黄山”，基础词典中包含有人名、组织机构名、地名、节日、产品名和时间共6个命名实体类别。Assuming that the acquired text to be processed is "Visit Huangshan Mingming", the basic dictionary contains 6 named entity categories: person's name, organization name, place name, festival, product name and time.

将待处理文本“明明去参观黄山”处理成“[CLS]明明去参观黄山[SEP]”后，输入到类别预测模型的预训练语言子模型中，得到对应的嵌入向量“E[CLS] E1、E2、E3、E4、E5、E6、E7、E[SEP]”。After processing the to-be-processed text "Mingming to visit Huangshan" into "[CLS] Mingming to visit Huangshan [SEP]", input it into the pre-trained language sub-model of the category prediction model, and get the corresponding embedding vector "E[CLS] E1 , E2, E3, E4, E5, E6, E7, E[SEP]".

将嵌入向量“E[CLS]E1、E2、E3、E4、E5、E6、E7、E[SEP]”输入到类别预测模型的命名实体识别子模型中，得到“E1+E2”和“E6+E7”为属于命名实体的目标词语。同时，可以得到“E1+E2”属于人名的概率值为0.98，属于组织机构名的概率值为0.31，属于地名的概率值为0.12，属于节日的概率值为0.06，属于产品名的概率值为0.45，属于时间的概率值为0.03；“E6+E7”属于人名的概率值为0.68，属于组织机构名的概率值为0.78，属于地名的概率值为0.89，属于节日的概率值为0.07，属于产品名的概率值为0.72，属于时间的概率值为0.04。Input the embedding vector "E[CLS]E1, E2, E3, E4, E5, E6, E7, E[SEP]" into the named entity recognition sub-model of the category prediction model to get "E1+E2" and "E6+ E7" is the target word belonging to the named entity. At the same time, it can be obtained that the probability value of "E1+E2" belonging to a person's name is 0.98, the probability value of belonging to an organization name is 0.31, the probability value of belonging to a place name is 0.12, the probability value of belonging to a festival is 0.06, and the probability value of belonging to a product name is 0.06. 0.45, the probability value of belonging to time is 0.03; the probability value of "E6+E7" belonging to a person's name is 0.68, the probability value of belonging to an organization name is 0.78, the probability value of belonging to a place name is 0.89, the probability value of belonging to a festival is 0.07, and the probability value of belonging to a holiday is 0.07. The probability value of product name is 0.72, and the probability value of belonging to time is 0.04.

假设第一设定阈值为0.95，第二设定阈值为0.9，第三设定阈值为0.6，则可以确定“E1+E2”对应的目标词语“明明”所属的命名实体类别为人名，并将目标词语“明明”加入至基础词典的人名中。Assuming that the first set threshold is 0.95, the second set threshold is 0.9, and the third set threshold is 0.6, it can be determined that the named entity category to which the target word "Mingming" corresponding to "E1+E2" belongs is a person's name, and the The target word "Mingming" is added to the person's name in the basic dictionary.

由于“E6+E7”对于的目标词语“黄山”分别属于人名、组织机构名、地名、节日、产品名和时间的概率值均小于第一设定阈值0.95和第二设定阈值 0.9，且大于第三设定阈值0.6的命名实体类别有人名、组织机构名、地名和产品名，则可以确定目标词语“黄山”分别与人名、组织机构名、地名和产品名各自包含的词语之间的相似度，并得到目标词语“黄山”与人名包含的词语的相似度为0.06，与组织机构名包含的词语的相似度为0.14，与地名包含的词语的相似度为0.76，与产品名包含的词语的相似度为0.04。可以确定目标词语“黄山”所属的命名实体类别为地名，并将目标词语“黄山”加入至基础词典的地名中。Because the probability values of the target word "Huangshan" for "E6+E7" belong to the name of the person, the name of the organization, the name of the place, the festival, the name of the product and the time are all less than the first set threshold of 0.95 and the second set threshold of 0.9, and greater than the first set threshold of 0.95 and the second set threshold of 0.9. 3. If the named entity category with a threshold of 0.6 is set as human name, organization name, place name and product name, the similarity between the target word "Huangshan" and the words contained in the name of the person, organization name, place name and product name can be determined respectively. , and the similarity between the target word "Huangshan" and the words contained in the person's name is 0.06, the similarity with the words contained in the organization name is 0.14, the similarity with the words contained in the place name is 0.76, and the similarity with the words contained in the product name is 0.06. The similarity is 0.04. It can be determined that the named entity category to which the target word "Huangshan" belongs is a place name, and the target word "Huangshan" is added to the place name of the basic dictionary.

在一些实施例中，可以将本申请提出的词典构造方法与相关技术中的基于 N-Gram的词典构建方法、基于word2Vec的情感词典构建方法、基于textrank 的词典构建方法，以及面向产品评价对象挖掘的领域词典构建方法进行比较，比较结果可以如下表1所示：In some embodiments, the dictionary construction method proposed in this application can be combined with the N-Gram-based dictionary construction method, word2Vec-based sentiment dictionary construction method, textrank-based dictionary construction method, and product evaluation-oriented object mining in the related art. The domain dictionary construction methods are compared, and the comparison results can be shown in Table 1 below:

表1Table 1

模型Model PrecisionPrecision RecallRecall F1-scoreF1-score 基于N-GramBased on N-Gram 40.040.0 —— —— 基于word2VecBased on word2Vec 60.360.3 37.537.5 46.246.2 基于textrankbased on textrank 59.859.8 58.658.6 59.259.2 面向产品评价对象挖掘Product Evaluation Object Mining 64.564.5 69.269.2 66.866.8 Our MethodOur Method 85.685.6 86.286.2 85.9 85.9

其中，Precision表示准确率，Recall表示召回率，F1-score是一个综合评价指标，且Precision、Recall和F1-score分别越高，表示方法的性能分别越好。Among them, Precision represents the accuracy rate, Recall represents the recall rate, and F1-score is a comprehensive evaluation index. The higher the Precision, Recall, and F1-score, respectively, the better the performance of the method.

从上表中可以看出，与基于N-Gram的词典构建方法、基于word2Vec的情感词典构建方法、基于textrank的词典构建方法，以及面向产品评价对象挖掘的领域词典构建方法相比，本申请提出的词典构造方法的Precision、Recall 表示召回率和F1-score都是最高的，表明本申请提出的词典构造方法的性能最好。As can be seen from the above table, compared with the N-Gram-based dictionary construction method, the word2Vec-based sentiment dictionary construction method, the textrank-based dictionary construction method, and the product evaluation object mining-oriented domain dictionary construction method, this application proposes The Precision, Recall, and F1-score of the dictionary construction method are the highest, indicating that the dictionary construction method proposed in this application has the best performance.

与图4a所示的词典构造方法基于同一发明构思，本申请实施例中还提供了一种词典构造装置，该词典构造装置可以布设在服务器或终端设备中。由于该装置是本申请词典构造方法对应的装置，并且该装置解决问题的原理与该方法相似，因此该装置的实施可以参见上述方法的实施，重复之处不再赘述。Based on the same inventive concept as the dictionary construction method shown in FIG. 4a, an embodiment of the present application also provides a dictionary construction apparatus, and the dictionary construction apparatus may be arranged in a server or a terminal device. Since the device is a device corresponding to the dictionary construction method of the present application, and the principle of solving the problem of the device is similar to that of the method, the implementation of the device can refer to the implementation of the above method, and the repetition will not be repeated.

图7示出了本申请实施例提供的一种词典构造装置的结构示意图，如图7 所示，该词典构造测装置包括获取模块701、词语识别模块702、类别识别模块703和词典构造模块704。FIG. 7 shows a schematic structural diagram of a dictionary construction device provided by an embodiment of the present application. As shown in FIG. 7 , the dictionary construction detection device includes an acquisition module 701 , a word recognition module 702 , a category recognition module 703 and a dictionary construction module 704 .

其中，获取模块701，用于获取待处理文本和基础词典；其中，基础词典包含多个词语类别；Wherein, the obtaining module 701 is used for obtaining the text to be processed and a basic dictionary; wherein, the basic dictionary includes a plurality of word categories;

词语识别模块702，用于基于已训练的类别预测模型，确定出待处理文本包含的至少一个候选词语，以及至少一个候选词语各自的语义信息；A word recognition module 702, configured to determine, based on the trained category prediction model, at least one candidate word contained in the text to be processed, and the respective semantic information of the at least one candidate word;

类别识别模块703，用于通过类别预测模型，根据至少一个候选词语各自的语义信息，选取出符合设定语义条件的至少一个目标词语，并确定至少一个目标词语各自分别属于多个词语类别的概率值；The category identification module 703 is configured to select at least one target word that meets the set semantic conditions according to the respective semantic information of at least one candidate word through the category prediction model, and determine the probability that each of the at least one target word belongs to multiple word categories. value;

词典构造模块704，用于将至少一个目标词语，分别归属至对应的概率值符合设定概率条件的词语类别中。The dictionary construction module 704 is configured to assign at least one target word to word categories whose corresponding probability values meet the set probability conditions, respectively.

可选的，类别预测模型包括预训练语言子模型和命名实体识别子模型；词语识别模块702，具体用于：Optionally, the category prediction model includes a pre-trained language sub-model and a named entity recognition sub-model; the word recognition module 702 is specifically used for:

基于待处理文本，通过预训练语言子模型，得到待处理文本包含的至少一个单词各自对应的词向量；其中，每个词向量表征相应单词的语义信息；Based on the text to be processed, by pre-training the language sub-model, a word vector corresponding to at least one word contained in the text to be processed is obtained; wherein, each word vector represents the semantic information of the corresponding word;

基于至少一个单词各自对应的词向量，通过命名实体识别子模型，对至少一个单词进行组合，得到至少一个候选词语，以及至少一个候选词语各自的语义信息；其中，每个候选词语包含至少一个单词。Based on the word vector corresponding to at least one word, at least one word is combined through a named entity recognition sub-model to obtain at least one candidate word and the respective semantic information of at least one candidate word; wherein, each candidate word contains at least one word .

可选的，类别识别模块703，具体用于：Optionally, the category identification module 703 is specifically used for:

通过命名实体识别子模型，从至少一个候选词语中，选取出语义信息属于命名实体的至少一个目标词语；命名实体为具有特定语义的实体名称；Through the named entity recognition sub-model, from at least one candidate word, at least one target word whose semantic information belongs to the named entity is selected; the named entity is an entity name with specific semantics;

针对至少一个目标词语，分别执行以下操作：通过命名实体识别子模型，根据一个目标词语的语义信息，确定一个目标词语分别属于多个词语类别的概率值。For at least one target word, the following operations are respectively performed: through the named entity recognition sub-model, according to the semantic information of a target word, determine a probability value that a target word belongs to multiple word categories respectively.

可选的，如图8所示，上述装置还可以包括模型训练模块801，用于：Optionally, as shown in FIG. 8 , the above-mentioned apparatus may further include a model training module 801 for:

获取训练数据集；训练数据集中包括多个文本数据样本，文本数据样本中标注有设定类别；Obtain a training data set; the training data set includes multiple text data samples, and the text data samples are marked with set categories;

基于训练数据集，对类别预测模型进行迭代训练，直到满足设定的收敛条件为止，其中，一次迭代训练过程包括：Based on the training data set, the category prediction model is iteratively trained until the set convergence conditions are met, wherein an iterative training process includes:

基于从训练数据集中抽取的文本数据样本，通过类别预测模型，确定文本数据样本中的至少一个目标词语，并确定至少一个目标词语各自对应的目标词语类别；Based on the text data samples extracted from the training data set, through the category prediction model, at least one target word in the text data sample is determined, and the target word category corresponding to each of the at least one target word is determined;

根据目标词语类别与设定类别，确定相应的损失值，并根据损失值，对类别预测模型进行参数调整。According to the target word category and the set category, determine the corresponding loss value, and adjust the parameters of the category prediction model according to the loss value.

可选的，类别预测模型包括预训练语言子模型和命名实体识别子模型；训练数据集包括百科文本数据样本、领域文本数据样本和命名实体识别文本数据样本；模型训练模块801还用于：Optionally, the category prediction model includes a pre-trained language sub-model and a named entity recognition sub-model; the training data set includes encyclopedia text data samples, domain text data samples and named entity recognition text data samples; the model training module 801 is also used for:

基于从百科文本数据样本和领域文本数据样本中，抽取的文本数据样本，通过预训练语言子模型，确定相应的嵌入向量样本；百科文本数据样本为无针对性领域的文本数据，每个领域文本数据样本为包括多个设定类别命名实体的文本数据，且标注有对应的领域类别；Based on the text data samples extracted from the encyclopedia text data samples and domain text data samples, the corresponding embedding vector samples are determined by pre-training the language sub-model; the encyclopedia text data samples are text data in untargeted fields, and each domain text The data sample is text data that includes multiple named entities of set categories, and is marked with the corresponding field category;

基于从嵌入向量样本和词语向量样本中抽取的向量样本，通过命名实体识别子模型，确定相应的至少一个目标词语及其各自对应的目标词语类别；词语向量样本是基于命名实体识别文本数据样本得到的；命名实体识别文本数据样本为包括至少一个命名实体的文本数据，且每个命名实体词语标注有对应的命名实体类别。Based on the vector samples extracted from the embedding vector samples and the word vector samples, the named entity recognition sub-model is used to determine the corresponding at least one target word and its corresponding target word category; the word vector samples are obtained based on the named entity recognition text data samples The named entity recognition text data sample is text data including at least one named entity, and each named entity word is marked with a corresponding named entity category.

可选的，模型训练模块801还用于：Optionally, the model training module 801 is also used for:

根据目标词语类别与领域类别，确定第一损失值，并根据第一损失值对预训练语言子模型进行参数调整；Determine the first loss value according to the target word category and domain category, and adjust the parameters of the pre-trained language sub-model according to the first loss value;

根据目标词语类别与领域类别、命名实体类别，确定第二损失值，并根据第二损失值对命名实体识别子模型进行参数调整。The second loss value is determined according to the target word category, the field category, and the named entity category, and the named entity recognition sub-model is parameter adjusted according to the second loss value.

对训练数据集中的百科文本数据样本、领域文本数据样本和命名实体识别文本数据样本，分别进行预处理操作；预处理操作包括数据筛选和格式转换中的至少一种。Preprocessing operations are respectively performed on the encyclopedia text data samples, the domain text data samples and the named entity recognition text data samples in the training data set; the preprocessing operations include at least one of data screening and format conversion.

可选的，词典构造模块704，具体用于：Optionally, the dictionary construction module 704 is specifically used for:

基于一个目标词语分别属于多个词语类别的概率值，确定出最大概率值对应的词语类别，并将词语类别作为目标类别；Based on the probability values of a target word belonging to multiple word categories, determine the word category corresponding to the maximum probability value, and use the word category as the target category;

若一个目标词语属于目标类别的概率值大于第一设定阈值，则将一个目标词语归属至目标类别中。If the probability value of a target word belonging to the target category is greater than the first set threshold, then a target word is assigned to the target category.

可选的，词典构造模块704，还用于：Optionally, the dictionary construction module 704 is further used for:

若一个目标词语分别属于多个词语类别的概率值不大于第二设定阈值，则选取出概率值大于第三设定阈值的至少一个词语类别，作为候选类别；第三设定阈值小于第二设定阈值；If the probability value of a target word belonging to multiple word categories is not greater than the second set threshold, at least one word category whose probability value is greater than the third set threshold is selected as a candidate category; the third set threshold is less than the second set threshold set threshold;

基于一个目标词语分别与至少一个候选类别各自包含的词语的相似度，选取出符合设定相似度条件的候选类别，作为目标类别，并将一个目标词语归属至目标类别中。Based on the similarity between a target word and words contained in at least one candidate category, a candidate category that meets the set similarity condition is selected as a target category, and a target word is assigned to the target category.

在介绍了本申请示例性实施方式的词典构造方法和装置之后，接下来，介绍根据本申请的另一示例性实施方式的电子设备。After introducing the dictionary construction method and apparatus of the exemplary embodiment of the present application, next, an electronic device according to another exemplary embodiment of the present application is introduced.

所属技术领域的技术人员能够理解，本申请的各个方面可以实现为系统、方法或程序产品。因此，本申请的各个方面可以具体实现为以下形式，即：完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等)，或硬件和软件方面结合的实施方式，这里可以统称为“电路”、“模块”或“系统”。As will be appreciated by one skilled in the art, various aspects of the present application may be implemented as a system, method or program product. Therefore, various aspects of the present application can be embodied in the following forms, namely: a complete hardware implementation, a complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, which may be collectively referred to herein as implementations "circuit", "module" or "system".

与上述方法实施例基于同一发明构思，本申请实施例中还提供了一种电子设备，参阅图9所示，其为应用本申请实施例的一种电子设备的一个硬件组成结构示意图，电子设备900可以至少包括处理器901、以及存储器902。其中，存储器902存储有程序代码，当程序代码被处理器901执行时，使得处理器901 执行上述任意一种词典构造方法的步骤。Based on the same inventive concept as the above method embodiments, an electronic device is also provided in the embodiments of the present application. Referring to FIG. 9, it is a schematic diagram of a hardware structure of an electronic device applying the embodiments of the present application. 900 may include at least a processor 901 and a memory 902 . Wherein, the memory 902 stores program codes, and when the program codes are executed by the processor 901, the processor 901 executes the steps of any one of the above dictionary construction methods.

在一些可能的实施方式中，根据本申请的计算装置可以至少包括至少一个处理器、以及至少一个存储器。其中，存储器存储有程序代码，当程序代码被处理器执行时，使得处理器执行本说明书上述描述的根据本申请各种示例性实施方式的词典构造方法的步骤。例如，处理器可以执行如图4a中所示的步骤。In some possible implementations, a computing device according to the present application may include at least one processor, and at least one memory. The memory stores program codes, which, when executed by the processor, cause the processor to execute the steps of the dictionary construction method according to various exemplary embodiments of the present application described above in this specification. For example, the processor may perform the steps shown in Figure 4a.

下面参照图10来描述根据本申请的这种实施方式的计算装置1000。如图 10所示，计算装置1000以通用计算装置的形式表现。计算装置1000的组件可以包括但不限于：上述至少一个处理单元1001、上述至少一个存储单元1002、连接不同系统组件(包括存储单元1002和处理单元1001)的总线1003。A computing device 1000 according to such an embodiment of the present application is described below with reference to FIG. 10 . As shown in Figure 10, computing device 1000 takes the form of a general-purpose computing device. Components of the computing device 1000 may include, but are not limited to, the above-mentioned at least one processing unit 1001 , the above-mentioned at least one storage unit 1002 , and a bus 1003 connecting different system components (including the storage unit 1002 and the processing unit 1001 ).

总线1003表示几类总线结构中的一种或多种，包括存储器总线或者存储器控制器、外围总线、处理器或者使用多种总线结构中的任意总线结构的局域总线。Bus 1003 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus structures.

存储单元1002可以包括易失性存储器形式的可读介质，例如随机存取存储器(RAM)10021和/或高速缓存存储单元10022，还可以进一步包括只读存储器(ROM)10023。Storage unit 1002 may include readable media in the form of volatile memory, such as random access memory (RAM) 10021 and/or cache storage unit 10022 , and may further include read only memory (ROM) 10023 .

存储单元1002还可以包括具有一组(至少一个)程序模块10024的程序/ 实用工具10025，这样的程序模块10024包括但不限于：操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage unit 1002 may also include a program/utility 10025 having a set (at least one) of program modules 10024 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, An implementation of a network environment may be included in each or some combination of these examples.

计算装置1000也可以与一个或多个外部设备1004(例如键盘、指向设备等)通信，还可与一个或者多个使得用户能与计算装置1000交互的设备通信，和/或与使得该计算装置1000能与一个或多个其它计算装置进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O) 接口1005进行。并且，计算装置1000还可以通过网络适配器1006与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。如图所示，网络适配器1006通过总线1003与用于计算装置1000 的其它模块通信。应当理解，尽管图中未示出，可以结合计算装置1000使用其它硬件和/或软件模块，包括但不限于：微代码、设备驱动器、冗余处理器、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。Computing device 1000 may also communicate with one or more external devices 1004 (eg, keyboards, pointing devices, etc.), may also communicate with one or more devices that enable a user to interact with computing device 1000, and/or communicate with the computing device 1000 can communicate with any device (eg, router, modem, etc.) that communicates with one or more other computing devices. Such communication may take place through input/output (I/O) interface 1005 . Also, the computing device 1000 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 1006 . As shown, network adapter 1006 communicates with other modules for computing device 1000 via bus 1003 . It should be understood that, although not shown, other hardware and/or software modules may be used in conjunction with computing device 1000, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives and data backup storage systems.

与上述方法实施例基于同一发明构思，本申请提供的词典构造方法的各个方面还可以实现为一种程序产品的形式，其包括程序代码，当程序产品在电子设备上运行时，程序代码用于使电子设备执行本说明书上述描述的根据本申请各种示例性实施方式的词典构造方法中的步骤，例如，电子设备可以执行如图 4a中所示的步骤。Based on the same inventive concept as the above-mentioned method embodiments, various aspects of the dictionary construction method provided by the present application can also be implemented in the form of a program product, which includes program code, and when the program product runs on an electronic device, the program code is used for The electronic device is caused to perform the steps in the dictionary construction method according to various exemplary embodiments of the present application described above in this specification, for example, the electronic device may perform the steps shown in FIG. 4a.

程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以是但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件，或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括：具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

尽管已描述了本申请的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例做出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。While the preferred embodiments of the present application have been described, additional changes and modifications to these embodiments may occur to those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be construed to include the preferred embodiment and all changes and modifications that fall within the scope of this application.

显然，本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样，倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内，则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

1. a dictionary construction method, is characterized in that, comprises:

Obtain the text to be processed and a basic dictionary; wherein, the basic dictionary contains multiple word categories;

Based on the trained category prediction model, determine at least one candidate word included in the text to be processed, and the respective semantic information of the at least one candidate word;

Through the category prediction model, according to the respective semantic information of the at least one candidate word, at least one target word that meets the set semantic conditions is selected, and it is determined that the at least one target word belongs to each of the multiple word categories. probability value;

The at least one target word is respectively assigned to word categories whose corresponding probability values meet the set probability conditions.

2. The method of claim 1, wherein the category prediction model comprises a pre-trained language sub-model and a named entity recognition sub-model; and the category prediction model based on the training determines the to-be-processed text At least one candidate word included, and the respective semantic information of the at least one candidate word, including:

Based on the text to be processed, through the pre-trained language sub-model, a word vector corresponding to at least one word contained in the text to be processed is obtained; wherein, each word vector represents the semantic information of the corresponding word;

Based on the respective word vectors corresponding to the at least one word, the at least one word is combined by the named entity recognition sub-model to obtain at least one candidate word and the respective semantic information of the at least one candidate word; wherein, Each candidate word contains at least one word.

3. The method according to claim 2, wherein, according to the respective semantic information of the at least one candidate word through the category prediction model, at least one target word that meets the set semantic conditions is selected, and Determining the probability values that the at least one target word respectively belongs to the plurality of word categories includes:

Through the named entity recognition sub-model, from the at least one candidate word, at least one target word whose semantic information belongs to a named entity is selected; the named entity is an entity name with specific semantics;

For the at least one target word, the following operations are respectively performed: using the named entity recognition sub-model, according to the semantic information of the one target word, determine a probability value that the one target word belongs to the multiple word categories respectively.

4. The method according to any one of claims 1 to 3, wherein the training process of the category prediction model comprises:

Obtaining a training data set; the training data set includes a plurality of text data samples, and the text data samples are marked with set categories;

Based on the training data set, the category prediction model is iteratively trained until a set convergence condition is met, wherein an iterative training process includes:

Based on the text data samples extracted from the training data set, through the category prediction model, at least one target word in the text data sample is determined, and the target word category corresponding to each of the at least one target word is determined;

According to the target word category and the set category, a corresponding loss value is determined, and according to the loss value, the parameters of the category prediction model are adjusted.

5. The method of claim 4, wherein the category prediction model comprises a pre-trained language sub-model and a named entity recognition sub-model; the training data set comprises encyclopedia text data samples, domain text data samples and naming Entity recognition text data samples;

determining at least one target word in the text data sample through the category prediction model based on the text data samples extracted from the training data set, and determining the target word category corresponding to each of the at least one target word, include:

Based on the text data samples extracted from the encyclopedia text data samples and the domain text data samples, the corresponding embedding vector samples are determined through the pre-trained language sub-model; the encyclopedia text data samples are texts in untargeted fields Data, each domain text data sample is text data that includes multiple named entities of set categories, and is marked with the corresponding domain category;

Based on the vector samples extracted from the embedding vector samples and the word vector samples, the named entity recognition sub-model determines at least one corresponding target word and its corresponding target word category; the word vector samples are based on the The named entity identification text data sample is obtained by describing the named entity identification text data sample; the named entity identification text data sample is text data including at least one named entity, and each named entity word is marked with a corresponding named entity category.

6. The method according to claim 5, wherein the corresponding loss value is determined according to the target word category and the set category, and the category prediction model is performed according to the loss value. Parameter adjustment, including:

Determine a first loss value according to the target word category and the domain category, and adjust the parameters of the pre-trained language sub-model according to the first loss value;

According to the target word category, the domain category, and the named entity category, a second loss value is determined, and parameters of the named entity recognition sub-model are adjusted according to the second loss value.

7. The method according to claim 5, wherein after acquiring the training data set, the method further comprises:

Preprocessing operations are respectively performed on the encyclopedia text data samples, the domain text data samples and the named entity recognition text data samples in the training data set; the preprocessing operations include at least one of data screening and format conversion.

8. The method according to any one of claims 1 to 3, wherein the attributing the at least one target word to word categories whose corresponding probability values meet the set probability conditions, respectively, comprises:

For the at least one target word, respectively perform the following operations:

Based on the probability values that a target word belongs to the multiple word categories respectively, determine the word category corresponding to the maximum probability value, and use the word category as the target category;

If the probability value of the one target word belonging to the target category is greater than the first set threshold, the one target word is assigned to the target category.

9. The method according to any one of claims 1 to 3, wherein the attributing the at least one target word to word categories whose corresponding probability values meet the set probability conditions, respectively, comprises:

If the probability value of a target word belonging to the multiple word categories is not greater than the second set threshold, at least one word category whose probability value is greater than the third set threshold is selected as a candidate category; the third set The threshold is less than the second set threshold;

Based on the similarity between the one target word and the words contained in the at least one candidate category, a candidate category that meets the set similarity condition is selected as the target category, and the one target word is assigned to the target in the category.

10. A dictionary construction device, comprising:

an acquisition module for acquiring the text to be processed and a basic dictionary; wherein the basic dictionary contains multiple word categories;

A word recognition module, configured to determine at least one candidate word contained in the text to be processed based on the trained category prediction model, and the respective semantic information of the at least one candidate word;

The category recognition module is used to select at least one target word that meets the set semantic conditions according to the respective semantic information of the at least one candidate word through the category prediction model, and determine that the at least one target word belongs to the The probability value of describing multiple word categories;

The dictionary construction module is used for attributing the at least one target word to word categories whose corresponding probability values meet the set probability conditions, respectively.

11. The apparatus according to claim 10, wherein the category prediction model comprises a pre-trained language sub-model and a named entity recognition sub-model; the word recognition module is specifically used for:

12. The apparatus of claim 11, wherein the category identification module is specifically configured to:

13. An electronic device, characterized in that it comprises a processor and a memory, wherein the memory stores a program code that, when executed by the processor, causes the processor to execute claim 1 The steps of any one of to 9.

14. A computer-readable storage medium, characterized in that it comprises program code, when the program code is run on an electronic device, the program code is used to cause the electronic device to execute any one of claims 1 to 9 a step of the method.

15. A computer program product, comprising a computer program/instruction, characterized in that, when the computer program/instruction is executed by a processor, the steps of the method of claims 1-9 are implemented.