CN115964490A

CN115964490A - Project label prediction method, system, electronic device and storage medium

Info

Publication number: CN115964490A
Application number: CN202211706930.XA
Authority: CN
Inventors: 舒文华; 徐绍珺; 蔡伟; 张克非
Original assignee: Thinvent Digital Technology Co Ltd
Current assignee: Thinvent Digital Technology Co Ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-14

Abstract

The invention provides an item label prediction method, system, electronic equipment and storage medium, belonging to the technical field of data processing. The method includes: obtaining character sequences corresponding to key texts representing investment projects to be classified; converting the character sequences into several embedded representations through mapping, and superimposing the several embedded representations to obtain project information word sequences; processing the project through the Bert language model The information word sequence outputs the word vector matrix; perform local feature extraction on the word vector matrix, and normalize the extracted local features to obtain the pooling result; use the full connection to transform the spliced pooling result to obtain the integrated feature; The softmax classifier learns the classification labels for the integrated features. This application uses natural language processing to convert project text into word vectors, and then performs convolutional neural network classification processing on word vectors to complete intelligent label classification, so as to improve the accuracy, comprehensiveness and efficiency of investment project label classification.

Description

Item label prediction method, system, electronic device and storage medium

技术领域technical field

本发明属于文本处理的技术领域，具体地涉及一种项目标签预测方法、系统、电子设备及存储介质。The invention belongs to the technical field of text processing, and in particular relates to an item label prediction method, system, electronic equipment and storage medium.

背景技术Background technique

投资项目的文本分类是根据文本内容将项目文本划分为预先定义好的类别，准确而又快速的项目标签分类可节省大量人力物力，在信息检索和信息存储上发挥着重要作用。随着软件技术的发展和普及，投资项目的管理软件在投资项目行业都得到了深入应用。就目前而言的发改行业全口径投资项目数据涉及97个大行业和1380个小行业，现有技术采用人工标注的方式针对项目文本的标签分类使得处理过程中受限于人员限制，其中，人工标注的方法是由标签分类人员根据自身经验的判断为项目文本确定标签分类，由于标签分类人员之间的经验丰富性存在差异，导致在项目标签分类的准确性、全面性、效率性等方面上存在不足。The text classification of investment projects is to divide project texts into pre-defined categories according to the text content. Accurate and fast project label classification can save a lot of manpower and material resources, and plays an important role in information retrieval and information storage. With the development and popularization of software technology, investment project management software has been deeply applied in the investment project industry. As far as the current development and reform industry full-caliber investment project data involves 97 large industries and 1380 small industries, the existing technology uses manual labeling to classify project text labels, which makes the processing process limited by personnel restrictions. Among them, The method of manual labeling is to determine the label classification for the item text by the label classifier according to the judgment of his own experience. Due to the difference in the richness of experience among the label classifiers, the accuracy, comprehensiveness, and efficiency of the item label classification There are deficiencies.

因此，如何实现投资项目的智能化标签分类以提升投资项目标签分类的准确度、全面性及效率性，显得尤为重要。Therefore, how to realize the intelligent label classification of investment projects to improve the accuracy, comprehensiveness and efficiency of investment project label classification is particularly important.

发明内容Contents of the invention

为了解决上述技术问题，本发明提供了一种项目标签预测方法、系统、电子设备及存储介质，使用自然语言处理将项目文本信息转化为词向量，再对词向量进行深度学习的卷积神经网络分类处理，最终完成对投资项目的智能标签分类，实现提升投资项目标签分类的准确度、全面性及效率性。In order to solve the above technical problems, the present invention provides a project label prediction method, system, electronic equipment and storage medium, which uses natural language processing to convert project text information into word vectors, and then performs deep learning on the convolutional neural network of word vectors Classification processing, and finally complete the intelligent label classification of investment projects, so as to improve the accuracy, comprehensiveness and efficiency of investment project label classification.

第一方面，本发明提供一种项目标签预测方法，包括：In a first aspect, the present invention provides a project label prediction method, including:

获取表征待分类投资项目的关键文本所对应的字符序列；Obtaining the character sequence corresponding to the key text characterizing the investment project to be classified;

将所述字符序列通过映射方式转换成若干嵌入表示，并将若干嵌入表示叠加得到项目信息词序列；其中，所述嵌入表示包括字符嵌入、位置嵌入及句子类型嵌入；Converting the character sequence into several embedding representations by means of mapping, and superimposing the several embedding representations to obtain the item information word sequence; wherein, the embedding representations include character embedding, position embedding and sentence type embedding;

通过Bert语言模型处理所述项目信息词序列输出词向量矩阵；Process the item information word sequence output word vector matrix by Bert language model;

针对所述词向量矩阵进行局部特征提取，并将提取的局部特征归一化处理得到池化结果；Performing local feature extraction for the word vector matrix, and normalizing the extracted local features to obtain a pooling result;

采用全连接对拼接后的所述池化结果进行变换处理得到整合特征；Transforming the spliced pooling results by full connection to obtain integrated features;

通过softmax分类器针对所述整合特征进行学习得到所述待分类投资项目的分类标签。A softmax classifier is used to learn the integrated features to obtain the classification labels of the investment items to be classified.

较佳地，所述获取表征待分类投资项目的关键文本所对应的字符序列的步骤具体包括：Preferably, the step of obtaining the character sequence corresponding to the key text characterizing the investment project to be classified specifically includes:

将待分类投资项目的项目名称、主要建设内容及行业领域进行串接，得到所述待分类投资项目的关键文本；Concatenate the project name, main construction content and industry field of the investment project to be classified to obtain the key text of the investment project to be classified;

将所述关键文本中的停用词进行去除得到字符组；removing stop words in the key text to obtain character groups;

将所述字符组中的前n个词与标识符进行拼接，并将所述标识符置于首位，以形成所述关键文本对应的字符序列。The first n words in the character group are spliced with identifiers, and the identifiers are placed in the first place to form a character sequence corresponding to the key text.

较佳地，所述通过Bert语言模型处理所述项目信息词序列输出词向量矩阵的步骤具体包括：Preferably, the step of processing the item information word sequence through the Bert language model to output a word vector matrix specifically includes:

将所述项目信息词序列转化成unicode，并通过Unicode码位去除所述unicode中不合法字符及多余空格，得到信息词字符串；Converting the item information word sequence into unicode, and removing illegal characters and redundant spaces in the unicode by Unicode code points, to obtain the information word string;

通过空格将所述信息词字符串中的中文字符进行分隔，并进行循环strip()操作，得到初始分词结果；The Chinese characters in the information word string are separated by spaces, and the loop strip() operation is performed to obtain the initial word segmentation result;

针对初始分词结果进行深处理得到目标分词结果；Perform deep processing on the initial word segmentation results to obtain the target word segmentation results;

将所述目标分词结果中的英文按照预设拆分原则进行拆分，得到词向量矩阵。The English in the target word segmentation result is split according to the preset splitting principle to obtain a word vector matrix.

较佳地，所述预设拆分原则具体为：Preferably, the preset splitting principle is specifically:

将英文按照subword词表进行拆分，每个单词拆分后的subword尽可能地长，采用贪婪最长优先匹配算法，对于每个单词，指针i＝0、j＝len从后向前匹配，直至单词的前缀[i:j]是subword词表中的一个subword，则将其取出，进而设置i＝j、j＝len，循环上述流程。Split the English language according to the subword vocabulary, the subword after each word is split as long as possible, adopt the greedy longest-first matching algorithm, for each word, the pointer i=0, j=len are matched from the back to the front, Until the prefix [i:j] of the word is a subword in the subword vocabulary, it is taken out, and then i=j, j=len are set, and the above-mentioned process is circulated.

较佳地，所述针对所述词向量矩阵进行局部特征提取，并将提取的局部特征归一化处理得到池化结果的步骤具体包括：Preferably, the step of performing local feature extraction on the word vector matrix, and normalizing the extracted local features to obtain a pooling result specifically includes:

利用卷积神经网络模型针对所述词向量矩阵进行局部特征提取，得到特征提取结果；Using a convolutional neural network model to perform local feature extraction on the word vector matrix to obtain a feature extraction result;

通过最大值池化操作将所述特征提取结果进行特征归一化，以选取局部最优特征得到池化结果。The feature extraction result is subjected to feature normalization through a maximum pooling operation, so as to select a local optimal feature to obtain a pooling result.

较佳地，所述全连接应用dropout策略以使部分神经元的激活概率固定在p值上，其中，p值的取值范围为0～1。Preferably, the full connection applies a dropout strategy to fix the activation probability of some neurons at a p value, wherein the p value ranges from 0 to 1.

较佳地，所述通过softmax分类器针对所述整合特征进行学习得到所述待分类投资项目的分类标签的具体步骤包括：Preferably, the specific steps of learning the integrated features through the softmax classifier to obtain the classification labels of the investment items to be classified include:

采用多类交叉熵函数作为卷积神经网络模型的损失函数；The multi-class cross-entropy function is used as the loss function of the convolutional neural network model;

将所述整合特征通过卷积神经网络模型计算以输出对应的分类标签。The integrated feature is calculated by a convolutional neural network model to output a corresponding classification label.

第二方面，本发明提供了一种项目标签预测系统，包括：In a second aspect, the present invention provides an item label prediction system, comprising:

表征模块，用于获取表征待分类投资项目的关键文本所对应的字符序列；A characterizing module, configured to obtain character sequences corresponding to key texts characterizing investment projects to be classified;

映射模块，用于将所述字符序列通过映射方式转换成若干嵌入表示，并将若干嵌入表示叠加得到项目信息词序列；其中，所述嵌入表示包括字符嵌入、位置嵌入及句子类型嵌入；The mapping module is used to convert the character sequence into several embedded representations by mapping, and superimpose several embedded representations to obtain the item information word sequence; wherein, the embedded representations include character embedding, position embedding and sentence type embedding;

语义模块，用于通过Bert语言模型处理所述项目信息词序列输出词向量矩阵；Semantic module, for processing described item information word sequence output word vector matrix by Bert language model;

处理模块，用于针对所述词向量矩阵进行局部特征提取，并将提取的局部特征归一化处理得到池化结果；A processing module, configured to extract local features for the word vector matrix, and normalize the extracted local features to obtain a pooling result;

变换模块，用于采用全连接对拼接后的所述池化结果进行变换处理得到整合特征；A transform module, configured to transform the spliced pooling results using full connections to obtain integrated features;

分类模块，用于通过softmax分类器针对所述整合特征进行学习得到所述待分类投资项目的分类标签。The classification module is used to learn the integrated features through a softmax classifier to obtain the classification labels of the investment items to be classified.

较佳地，所述表征模块包括：Preferably, the characterization module includes:

串接单元，用于将待分类投资项目的项目名称、主要建设内容及行业领域进行串接，得到所述待分类投资项目的关键文本；The concatenation unit is used to concatenate the project names, main construction contents and industry fields of the investment projects to be classified to obtain the key text of the investment projects to be classified;

去除单元，用于将所述关键文本中的停用词进行去除得到字符组；A removal unit, configured to remove stop words in the key text to obtain a character group;

拼接单元，用于将所述字符组中的前n个词与标识符进行拼接，并将所述标识符置于首位，以形成所述关键文本对应的字符序列。A concatenating unit, configured to concatenate the first n words in the character group with identifiers, and put the identifiers in the first place, so as to form a character sequence corresponding to the key text.

较佳地，所述语义模块包括：Preferably, the semantic module includes:

转化单元，用于将所述项目信息词序列转化成unicode，并通过Unicode码位去除所述unicode中不合法字符及多余空格，得到信息词字符串；The conversion unit is used to convert the item information word sequence into unicode, and remove illegal characters and redundant spaces in the unicode through the Unicode code point to obtain the information word string;

循环单元，用于通过空格将所述信息词字符串中的中文字符进行分隔，并进行循环strip()操作，得到初始分词结果；The loop unit is used to separate the Chinese characters in the information word string by spaces, and perform a loop strip() operation to obtain the initial word segmentation result;

深处理单元，用于针对初始分词结果进行深处理得到目标分词结果；The deep processing unit is used to perform deep processing on the initial word segmentation result to obtain the target word segmentation result;

拆分单元，用于将所述目标分词结果中的英文按照预设拆分原则进行拆分，得到词向量矩阵。The splitting unit is configured to split the English in the target word segmentation result according to a preset splitting principle to obtain a word vector matrix.

较佳地，所述处理模块包括：Preferably, the processing module includes:

提取单元，用于利用卷积神经网络模型针对所述词向量矩阵进行局部特征提取，得到特征提取结果；The extraction unit is used to perform local feature extraction for the word vector matrix using a convolutional neural network model to obtain a feature extraction result;

池化单元，用于通过最大值池化操作将所述特征提取结果进行特征归一化，以选取局部最优特征得到池化结果。The pooling unit is used to perform feature normalization on the feature extraction result through a maximum pooling operation, so as to select a local optimal feature to obtain a pooling result.

较佳地，所述分类模块包括：Preferably, the classification module includes:

定义单元，用于采用多类交叉熵函数作为卷积神经网络模型的损失函数；Define a unit for adopting a multi-class cross-entropy function as a loss function of a convolutional neural network model;

分类单元，用于将所述整合特征通过卷积神经网络模型计算以输出对应的分类标签。The classification unit is used to calculate the integrated features through the convolutional neural network model to output corresponding classification labels.

第三方面，本申请实施例提供了一种电子设备，包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如第一方面所述的项目标签预测方法。In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, Implement the item label prediction method as described in the first aspect.

第四方面，本申请实施例提供了一种存储介质，其上存储有计算机程序，该程序被处理器执行时实现如第一方面所述的项目标签预测方法。In a fourth aspect, an embodiment of the present application provides a storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for predicting item tags as described in the first aspect is implemented.

相比现有技术，本发明的有益效果为：采用NLP中的通用语言模型BERT模型，能捕捉整个句子中文本信息的字符序列信息、上下文关系信息、语法语境信息等，解决了一词多义问题。且Bert模型使用转换器的编码器可并行执行运算，可叠加多层，对文本信息有很强的表征能力，其输出的词向量能够非常好的表征文本信息的特征，接着将其作为下游自然语言处理任务的模型参数或者模型输入以提高模型的整体性能。后接CNN层，利用不同大小的卷积核捕捉句子中不同长度词的信息，中文词语蕴含的信息往往比字更丰富，所以把词语信息从整句数据中提取出来后再进行分类，分类结果更理想，实现提升投资项目标签分类的准确度、全面性及效率性。Compared with the prior art, the beneficial effects of the present invention are: adopting the general language model BERT model in NLP, it can capture the character sequence information, context information, grammatical context information, etc. of the text information in the entire sentence, and solve the problem of multiple words question of justice. Moreover, the Bert model uses the encoder of the converter to perform operations in parallel, and can superimpose multiple layers. It has a strong ability to represent text information. The word vector output by it can very well represent the characteristics of text information, and then use it as a downstream natural Model parameters or model inputs for language processing tasks to improve the overall performance of the model. Followed by the CNN layer, convolution kernels of different sizes are used to capture the information of words of different lengths in the sentence. The information contained in Chinese words is often richer than that of words, so the word information is extracted from the entire sentence data and then classified. The classification results More ideally, to improve the accuracy, comprehensiveness and efficiency of investment project label classification.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the descriptions of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only of the present invention. For some embodiments, those of ordinary skill in the art can also obtain other drawings based on these drawings without paying creative efforts.

图1为本发明实施例1提供的项目标签预测方法的流程图；Fig. 1 is the flow chart of the item tag prediction method provided by Embodiment 1 of the present invention;

图2是本发明实施例2提供的与实施例1方法对应的项目标签预测系统结构框图；Fig. 2 is the structural block diagram of the item tag prediction system corresponding to the method of embodiment 1 provided by embodiment 2 of the present invention;

图3是本发明实施例3提供的电子设备的硬件结构示意图。FIG. 3 is a schematic diagram of a hardware structure of an electronic device provided by Embodiment 3 of the present invention.

附图标记说明：Explanation of reference signs:

10-表征模块、11-串接单元、12-去除单元、13-拼接单元；10-characterization module, 11-serial unit, 12-removal unit, 13-stitching unit;

20-映射模块；20 - mapping module;

30-语义模块、31-转化单元、32-循环单元、33-深处理单元、34-拆分单元；30-semantic module, 31-transformation unit, 32-circulation unit, 33-deep processing unit, 34-splitting unit;

40-处理模块、41-提取单元、42-池化单元；40-processing module, 41-extraction unit, 42-pooling unit;

50-变换模块；50-transformation module;

60-分类模块、61-定义单元、62-分类单元；60-taxonomic module, 61-definition unit, 62-taxonomic unit;

70-总线、71-处理器、72-存储器、73-通信接口。70-bus, 71-processor, 72-memory, 73-communication interface.

具体实施方式Detailed ways

现在将参考附图更全面地描述示例实施方式。然而，示例实施方式能够以多种形式实施，且不应被理解为限于在此阐述的范例；相反，提供这些实施方式使得本公开将更加全面和完整，并将示例实施方式的构思全面地传达给本领域的技术人员。Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the concept of example embodiments to those skilled in the art.

此外，所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中，提供许多具体细节从而给出对本公开的实施例的充分理解。然而，本领域技术人员将意识到，可以实践本公开的技术方案而没有特定细节中的一个或更多，或者可以采用其它的方法、组元、装置、步骤等。在其它情况下，不详细示出或描述公知方法、装置、实现或者操作以避免模糊本公开的各方面。Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided in order to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will appreciate that the technical solutions of the present disclosure may be practiced without one or more of the specific details, or other methods, components, means, steps, etc. may be employed. In other instances, well-known methods, apparatus, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

附图中所示的方框图仅仅是功能实体，不一定必须与物理上独立的实体相对应。即，可以采用软件形式来实现这些功能实体，或在一个或多个硬件模块或集成电路中实现这些功能实体，或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。The block diagrams shown in the drawings are merely functional entities and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices entity.

附图中所示的流程图仅是示例性说明，不是必须包括所有的内容和操作/步骤，也不是必须按所描述的顺序执行。例如，有的操作/步骤还可以分解，而有的操作/步骤可以合并或部分合并，因此实际执行的顺序有可能根据实际情况改变。The flow charts shown in the drawings are only exemplary illustrations, and do not necessarily include all contents and operations/steps, nor must they be performed in the order described. For example, some operations/steps can be decomposed, and some operations/steps can be combined or partly combined, so the actual order of execution may be changed according to the actual situation.

随着软件技术的发展和普及，投资项目管理软件在投资项目行业都得到了深入应用，传统投资项目管理模式中，由于项目库分类智能化、精细化、可扩展性等方面不足，导致项目信息支撑能力不足。诸如：当提出了对区域投资新要求时，需要人工去新建标签，并同步更新项目全口径库数据，进而形成新的主题分析。目前对不同项目分类均由人工去判断，实际操作人员需要对大量的数据，进行分析处理。而且在处理的过程中会受限于人员限制，导致无法及时响应对项目分类的快速分析需求。本申请正是基于数据模型针对项目分类进行判断，减少人力投入，提高分析准确性，使得分析结果更具有科学性。With the development and popularization of software technology, investment project management software has been deeply applied in the investment project industry. In the traditional investment project management mode, due to the lack of intelligence, refinement, and scalability of the project library classification, the project information Insufficient support. For example: when a new requirement for regional investment is put forward, it is necessary to manually create a new label, and simultaneously update the project's full-caliber database data to form a new theme analysis. At present, the classification of different items is judged manually, and actual operators need to analyze and process a large amount of data. Moreover, the processing process will be limited by personnel restrictions, resulting in the inability to respond to the rapid analysis requirements for item classification in a timely manner. This application is based on the data model to judge the project classification, reduce manpower input, improve the accuracy of analysis, and make the analysis results more scientific.

实施例1Example 1

具体而言，图1所示为本实施例所提供的一种项目标签预测方法的流程示意图。Specifically, FIG. 1 is a schematic flowchart of a method for predicting an item label provided by this embodiment.

如图1所示，本实施例的项目标签预测方法包括以下步骤：As shown in Figure 1, the item label prediction method of the present embodiment includes the following steps:

S101，获取表征待分类投资项目的关键文本所对应的字符序列。S101. Obtain a character sequence corresponding to a key text characterizing an investment project to be classified.

具体地，根据自然语言处理的任务类型，将原始文本作为自然语言处理的输入数据，由于原始文本的离散度高，数据类型不能够在自然语言处理过程中直接被调用和处理，需要将所述原始文本的原始数据进行数据转化，转化为字符序列，再对字符序列进行分词处理。本实施例中，通过选用某一投资项目的项目名称、主要建设内容和行业领域的文本信息作为待分类投资项目的关键文本，是因为项目名称、主要建设内容和行业领域的文本信息是因为这三项是最能反映项目建设内容的文本字段。通过作为执行主体的编码器完成对原始数据的信息提取并去除停用词后，将文本前n个词与BERT既定标识符[CLS]进行拼接，其中[CLS]置于首位，用于表征待分类投资项目的关键文本所对应的字符序列。Specifically, according to the task type of natural language processing, the original text is used as the input data of natural language processing. Due to the high degree of dispersion of the original text, the data type cannot be directly called and processed in the process of natural language processing. The original data of the original text is converted into a character sequence, and then word segmentation is performed on the character sequence. In this embodiment, by selecting the project name, main construction content and text information of the industry field of a certain investment project as the key text of the investment project to be classified, it is because the text information of the project name, main construction content and industry field is because this The three items are the text fields that best reflect the construction content of the project. After completing the information extraction of the original data and removing the stop words through the encoder as the execution subject, the first n words of the text are spliced with the BERT established identifier [CLS], where [CLS] is placed in the first place and used to represent the The character sequence corresponding to the key text of the classified investment project.

进一步地，步骤S101的具体步骤包括：Further, the specific steps of step S101 include:

S1011，将待分类投资项目的项目名称、主要建设内容及行业领域进行串接，得到所述待分类投资项目的关键文本。S1011. Concatenate the project name, main construction content and industry field of the investment project to be classified to obtain the key text of the investment project to be classified.

具体地，投资项目是指在规定期限内为完成某项开发目标(或一组开发目标)而规划和实施的活动、机构以及其他各方面所构成的独立整体。针对待分类投资项目的文本，通常投资项目的整份文件具有一定的格式要求，诸如具有项目名称、主要建设内容、行业领域、可行性分析、财务预设等模块，为了最能反映投资项目的建设内容，本实施例选用投资项目的项目名称、主要建设内容和行业领域的文本信息作为选用信息，将该选用信息进行语句拼接，得到最能反映待分类投资项目的关键文本。Specifically, an investment project refers to an independent whole composed of activities, institutions and other aspects planned and implemented to accomplish a certain development goal (or a group of development goals) within a specified period. For the text of the investment project to be classified, usually the entire document of the investment project has certain format requirements, such as project name, main construction content, industry field, feasibility analysis, financial presupposition and other modules, in order to best reflect the investment project. Construction content, this embodiment selects the project name of the investment project, the main construction content and the text information of the industry field as the selection information, and splices the selection information to obtain the key text that best reflects the investment project to be classified.

S1012，将所述关键文本中的停用词进行去除得到字符组。S1012. Remove stop words in the key text to obtain a character group.

具体地，停用词是指在信息检索中，为节省存储空间和提高搜索效率，在处理自然语言数据(或文本)之前或之后会自动过滤掉某些字或词，这些字或词即被称为停用词。这些停用词都是人工输入、非自动化生成的，生成后的停用词会形成一个停用词表。当利用jieba进行中文分词时，主要是句子中出现的词语都会被划分，而有些词语是没有实际意思的，对于后续的关键词提取就会加大工作量，并且可能提取的关键词是无效的。所以在分词处理会引入停用词去除优化分词的结果。对于停用词，可以自己手动添加到一个txt文件中，然后在需要时导入文件，也可以利用已经整理好的停用词表。当然，在已有的停用词表基础上，如果还有一些词语不需要，也可以自己完善停用词表。Specifically, stop words mean that in information retrieval, in order to save storage space and improve search efficiency, some words or words are automatically filtered out before or after processing natural language data (or text), and these words or words are called called stop words. These stop words are all manually input and non-automatically generated, and the generated stop words will form a stop word list. When using jieba for Chinese word segmentation, the main words that appear in the sentence will be divided, and some words have no practical meaning, which will increase the workload for subsequent keyword extraction, and the keywords that may be extracted are invalid . Therefore, in word segmentation processing, stop words will be introduced to remove and optimize word segmentation results. For stop words, you can manually add them to a txt file, and then import the file when needed, or you can use the already organized stop words list. Of course, on the basis of the existing stop vocabulary list, if there are still some words that are not needed, you can also improve the stop vocabulary list yourself.

S1013，将所述字符组中的前n个词与标识符进行拼接，并将所述标识符置于首位，以形成所述关键文本对应的字符序列。S1013. Splice the first n words in the character group with identifiers, and place the identifiers at the first place, so as to form a character sequence corresponding to the key text.

具体地，本实施例的BERT模型的标识符具体为[CLS]，[CLS]就是classification的意思，可理解为用于下游的分类任务。对于语句对分类任务，BERT模型除添加[CLS]符号并将对应的输出作为文本的语义表示，还对输入的两句话用一个[SEP]符号作分割，并分别对两句话附加两个不同的文本向量以作区分。本实施例中，将获取的字符组中前n个词与BERT标识符[CLS]进行拼接，并将[CLS]置于首位，形成关键文本对应的字符序列，用于标识整个输入的语义。Specifically, the identifier of the BERT model in this embodiment is specifically [CLS], and [CLS] means classification, which can be understood as being used for downstream classification tasks. For sentence pair classification tasks, the BERT model not only adds the [CLS] symbol and uses the corresponding output as the semantic representation of the text, but also divides the input two sentences with a [SEP] symbol, and attaches two sentences to the two sentences respectively. Different text vectors to differentiate. In this embodiment, the first n words in the obtained character group are spliced with the BERT identifier [CLS], and [CLS] is placed in the first place to form a character sequence corresponding to the key text, which is used to identify the semantics of the entire input.

S102，将所述字符序列通过映射方式转换成若干嵌入表示，并将若干嵌入表示叠加得到项目信息词序列；其中，所述嵌入表示包括字符嵌入、位置嵌入及句子类型嵌入。S102. Convert the character sequence into several embedded representations by means of mapping, and superimpose the several embedded representations to obtain an item information word sequence; wherein, the embedded representations include character embedding, position embedding, and sentence type embedding.

具体地，上下文特征提取层通过BiGRU、CNN和BiGCN编码器来提取句子的多粒度特征表示和区域特征表示，多特征嵌入层BiGRU能够融合前向和后向的信息，因此可以更好的提取特征信息。输入一个字符序列，上下文特征提取层的任务是充分捕获句子的语义特征信息。本实施例在字符嵌入的基础上添加了位置嵌入以及句子类型嵌入。Specifically, the contextual feature extraction layer uses BiGRU, CNN and BiGCN encoders to extract the multi-granularity feature representation and regional feature representation of sentences, and the multi-feature embedding layer BiGRU can fuse forward and backward information, so it can better extract features information. Input a character sequence, the task of the contextual feature extraction layer is to fully capture the semantic feature information of the sentence. This embodiment adds position embedding and sentence type embedding on the basis of character embedding.

其中，字符嵌入：CNN具有较强的局部特征提取能力，因此通过CNN提取每个单词的字符特征表示H_c，具体公式如下:Among them, character embedding: CNN has a strong local feature extraction ability, so the character feature representation _Hc of each word is extracted through CNN, and the specific formula is as follows:

H_c＝FCL(Maxpooling(Conv(c1，c2，...，cm))， _Hc = FCL(Maxpooling(Conv(c1, c2, . . . , cm)),

其中，FCL()属于全连接操作，Maxpooling()属于最大池化操作，Conv()属于卷积操作，m是单词的长度。Among them, FCL() belongs to the full connection operation, Maxpooling() belongs to the maximum pooling operation, Conv() belongs to the convolution operation, and m is the length of the word.

其中，位置嵌入：在一个序列中，单词的出现位置能够为单词本身提供额外的语义信息。本文基于正余弦函数获得每个字符的位置信息，正弦函数和余弦函数所产生的向量值域在0～1之间，这有利于模型的训练和收敛。输入一个文本序列，经过正余弦函数变换得到位置嵌入H_f，具体方式如下:Among them, position embedding: in a sequence, the position of a word can provide additional semantic information for the word itself. In this paper, the position information of each character is obtained based on the sine and cosine functions. The value range of the vector generated by the sine and cosine functions is between 0 and 1, which is conducive to the training and convergence of the model. Input a text sequence, and get the position embedding H _f through sine and cosine function transformation, the specific method is as follows:

其中，f是字符在序列中出现的位置，d_p代表正余弦函数生成位置向量的维度。正余弦函数的维度和周期相互影响，因此它可以捕获单词之间的相对位置和绝对位置信息。Among them, f is the position where the character appears in the sequence, and d _p represents the dimension of the position vector generated by the sine-cosine function. The dimension and period of the sine-cosine function influence each other, so it can capture relative and absolute position information between words.

S103，通过Bert语言模型处理所述项目信息词序列输出词向量矩阵。S103, process the project information word sequence through the Bert language model to output a word vector matrix.

具体地，Bert语言模型是完全意义上的双向语言模型，能捕捉整个句子中字序列信息、上下文关系信息、语法语境信息等，解决一词多义问题。词向量被用作下游模型的高质量特征输入。NLP模型(如LSTMs或CNNs)需要以数字向量的形式输入，意味着需要将词汇表和部分语音等特征转换为数字表示，这些特征嵌入是由Word2Vec或Fasttext等模型产生的。BERT与Word2Vec之类的模型相比提供一个优势，因为尽管Word2Vec下的每个单词都有一个固定的表示，而与单词出现的上下文无关，BERT生成的单词表示是由单词周围的单词动态通知的。Specifically, the Bert language model is a bidirectional language model in a complete sense, which can capture word sequence information, contextual relationship information, grammatical context information, etc. in the entire sentence to solve the problem of polysemy. Word vectors are used as high-quality feature inputs for downstream models. NLP models (such as LSTMs or CNNs) require input in the form of numerical vectors, meaning that features such as vocabulary and parts of speech need to be converted into numerical representations. These feature embeddings are produced by models such as Word2Vec or Fasttext. BERT offers an advantage over models like Word2Vec because, while under Word2Vec each word has a fixed representation independent of the context in which the word appears, BERT generates word representations that are dynamically informed by the words around it .

进一步地，步骤S103的具体步骤包括：Further, the specific steps of step S103 include:

S1031，将所述项目信息词序列转化成unicode，并通过Unicode码位去除所述unicode中不合法字符及多余空格，得到信息词字符串。S1031. Convert the item information word sequence into unicode, and remove illegal characters and redundant spaces in the unicode through Unicode code points to obtain an information word string.

具体地，unicode是为解决传统的字符编码方案的局限而产生的，它为每种语言中的每个字符设定了统一并且唯一的二进制编码，以满足跨语言、跨平台进行文本转换、处理的要求。本实施例中将项目信息词序列中的字符串或者字节转换成unicode目的在于方便操作，因为后续操作要判断中文、英文、特殊符号等等；并通过Unicode码位去除unicode中不合法字符及多余空格。Specifically, unicode was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique binary encoding for each character in each language to meet the needs of cross-language and cross-platform text conversion and processing. requirements. In the present embodiment, the character string or byte in the project information word sequence are converted into unicode purpose to facilitate operation, because follow-up operations will judge Chinese, English, special symbols, etc.; and remove illegal characters and characters in unicode by Unicode code points extra spaces.

S1032，通过空格将所述信息词字符串中的中文字符进行分隔，并进行循环strip()操作，得到初始分词结果。S1032. Separate the Chinese characters in the information word string by spaces, and perform a loop strip() operation to obtain an initial word segmentation result.

具体地，strip()操作用于移除字符串头尾指定的字符或字符序列，只能删除开头或是结尾的字符，不能删除中间部分的字符，诸如包括空格、换行(\n)、制表符(\t)。本实施例中，对于text中的字符，首先判断其是不是中文字符，是的话在其前后加上一个空格，否则原样输出。经过这步后，中文被按字分开，用空格分隔，但英文数字等仍然保持原状。Specifically, the strip() operation is used to remove the characters or character sequences specified at the beginning and end of the string. Only the characters at the beginning or the end can be deleted, but the characters in the middle cannot be deleted, such as spaces, newlines (\n), and restrictions. Tabulator (\t). In this embodiment, for the characters in the text, it is first judged whether it is a Chinese character, if yes, a space is added before and after it, otherwise it is output as it is. After this step, Chinese characters are separated by characters and separated by spaces, but English numbers, etc. remain as they are.

S1033，针对初始分词结果进行深处理得到目标分词结果。S1033, performing deep processing on the initial word segmentation result to obtain a target word segmentation result.

具体地，首先对text进行strip()操作，去掉两边多余空白字符，然后如果剩下的是一个空字符串，则直接返回空列表，否则进行split()操作，得到最初的分词结果orig_tokens；分词得到的列表继续处理，将语料包含的变音符号去掉，还原成字符形式，或者根据标点符号分词。Specifically, first perform the strip() operation on the text to remove redundant blank characters on both sides, and then return an empty list if the remaining is an empty string, otherwise perform the split() operation to obtain the original word segmentation result orig_tokens; word segmentation The obtained list continues to be processed, and the diacritics contained in the corpus are removed and restored to character form, or word segmentation is performed according to punctuation marks.

S1034，将所述目标分词结果中的英文按照预设拆分原则进行拆分，得到词向量矩阵。其中，所述预设拆分原则具体为：将英文按照subword词表进行拆分，每个单词拆分后的subword尽可能地长，采用贪婪最长优先匹配算法，对于每个单词，指针i＝0、j＝len从后向前匹配，直至单词的前缀[i:j]是subword词表中的一个subword，则将其取出，进而设置i＝j、j＝len，循环上述流程。S1034. Split the English in the target word segmentation result according to a preset splitting principle to obtain a word vector matrix. Wherein, the preset splitting principle is specifically: split English according to the subword vocabulary, the subword after each word is split as long as possible, adopt the greedy longest-first matching algorithm, and for each word, the pointer i =0, j=len are matched from back to front until the prefix [i:j] of the word is a subword in the subword vocabulary, then it is taken out, and then i=j, j=len are set, and the above-mentioned process is circulated.

具体地，将目标分词结果进行进一步的WordPiece Tokenizer拆分，首先，对中文来说，字粒度已经是最小的不可拆的粒度了，没法再进行subword，subword基本上是对英文进行处理；将单词按照subword词表进行拆分，分词路径的确定过程可采用贪婪匹配，例如，尽可能匹配最长的字符串或者编码序列，获取分词路径；还可以采用非贪心匹配，例如，尽可能匹配最短的字符串或者编码序列，找到字符或者编码中所有的连续成词的分词路径，并根据所述分词路径确定分词片段，分词片段以字符集合或者编码集合的形式存在，将所述分词片段进行编码，以使所述分词片段进行向量化。本实施例中，采用贪婪最长优先匹配算法。Specifically, the target word segmentation results are further split by WordPiece Tokenizer. First of all, for Chinese, the word granularity is already the smallest inseparable granularity, and there is no way to perform subword. Subword basically processes English; Words are split according to the subword vocabulary, and the process of determining the word segmentation path can use greedy matching, for example, matching the longest string or code sequence as much as possible to obtain the word segmentation path; non-greedy matching can also be used, for example, matching the shortest possible character string or code sequence, find the word segmentation path of all consecutive words in the character or code, and determine the word segmentation fragment according to the word segmentation path, the word segmentation fragment exists in the form of a character set or a code set, and encode the word segmentation fragment , so that the word segmentation segment is vectorized. In this embodiment, a greedy longest-first matching algorithm is used.

S104，针对所述词向量矩阵进行局部特征提取，并将提取的局部特征归一化处理得到池化结果。S104, performing local feature extraction on the word vector matrix, and normalizing the extracted local features to obtain a pooling result.

具体地，由于词向量矩阵的维度较高且包含部分噪声，故引入多核卷积神经网络对其进行优化表示。卷积神经网络可通过滑动窗口机制对同一区域内的所有特征进行卷积变换从而有效保留词语的局部特征。考虑模型的训练时间和准确率，在实践中，将卷积核设置为窗口是3、4、5的混合卷积核，既可以保证模型较低的训练复杂度有拥有良好的分类效果，以词向量矩阵作为卷积层的输入，使用多个卷积核大小为(2，3，4)对其进行局部特征提取，得到对应的特征提取结果，之后通过Max Pooling池化操作选取局部最优特征。Specifically, since the dimension of the word vector matrix is high and contains some noise, a multi-core convolutional neural network is introduced to optimize its representation. The convolutional neural network can perform convolution transformation on all features in the same area through the sliding window mechanism to effectively preserve the local features of words. Considering the training time and accuracy of the model, in practice, setting the convolution kernel as a mixed convolution kernel with a window of 3, 4, and 5 can ensure that the model has a low training complexity and has a good classification effect. The word vector matrix is used as the input of the convolutional layer, and multiple convolution kernels with a size of (2, 3, 4) are used for local feature extraction to obtain the corresponding feature extraction results, and then the local optimum is selected through the Max Pooling operation feature.

进一步地，步骤S104的具体步骤包括：Further, the specific steps of step S104 include:

S1041，利用卷积神经网络模型针对所述词向量矩阵进行局部特征提取，得到特征提取结果。S1041. Use a convolutional neural network model to extract local features from the word vector matrix to obtain a feature extraction result.

具体地，图像的每一个像素点里都存储着图像的信息。定义一个卷积核(相当于权重)，用来从图像中提取一定的特征，卷积核与数字矩阵对应位相乘再相加，得到卷积层输出结果。卷积核的取值在没有以往学习的经验下，可由函数随机生成，再逐步训练调整，提取图片每个小部分里具有的特征。Specifically, image information is stored in each pixel of the image. Define a convolution kernel (equivalent to a weight), which is used to extract certain features from the image. The convolution kernel is multiplied and added to the corresponding bits of the digital matrix to obtain the output result of the convolution layer. The value of the convolution kernel can be randomly generated by the function without previous learning experience, and then gradually trained and adjusted to extract the features of each small part of the picture.

S1042，通过最大值池化操作将所述特征提取结果进行特征归一化，以选取局部最优特征得到池化结果。S1042. Perform feature normalization on the feature extraction result through a maximum pooling operation, so as to select a local optimal feature to obtain a pooling result.

具体地，最大值池化操作采用max pooling方法，该方法是取一个区域的最大值。因此当图像发生平移、缩放、旋转等较小的变化时，依然很有可能在同一位置取到最大值，与变化前的响应相同，由此实现了仿射不变性。本实施例的池化目的为了减少训练参数的数量，降低卷积层输出的特征向量的维度；减小过拟合现象，只保留最有用的图片信息，减少噪声的传递。Specifically, the maximum pooling operation adopts the max pooling method, which is to take the maximum value of an area. Therefore, when the image undergoes minor changes such as translation, scaling, and rotation, it is still very likely to take the maximum value at the same position, which is the same as the response before the change, thus achieving affine invariance. The purpose of pooling in this embodiment is to reduce the number of training parameters, reduce the dimension of the feature vector output by the convolutional layer; reduce over-fitting phenomenon, retain only the most useful image information, and reduce the transmission of noise.

S105，采用全连接对拼接后的所述池化结果进行变换处理得到整合特征。S105, transforming the spliced pooling result by using full connection to obtain an integrated feature.

具体地，所述全连接应用dropout策略以使部分神经元的激活概率固定在p值上，其中，p值的取值范围为0～1；通过dropout策略可以使某些神经元的激活概率固定在p值上，使模型在向前传输过程中不会太依赖某些局部特征，使模型的鲁棒性更好，泛化能力更强。本实施例中，卷积层和池化层的工作就是提取特征，并减少原始图像带来的参数。为了生成最终的输出，需要应用全连接层来生成一个等于需要的类的数量的分类器，全连接层的工作原理和之前的神经网络学习很类似，需要把池化层输出的张量重新切割成一些向量，乘上权重矩阵，加上偏置值，然后对其使用ReLU激活函数，用梯度下降法优化参数既可。Specifically, the full connection applies a dropout strategy to fix the activation probability of some neurons on the p value, where the p value ranges from 0 to 1; the dropout strategy can make the activation probability of some neurons fixed In terms of p value, the model will not be too dependent on some local features during the forward transmission process, so that the robustness of the model is better and the generalization ability is stronger. In this embodiment, the job of the convolutional layer and the pooling layer is to extract features and reduce the parameters brought by the original image. In order to generate the final output, a fully connected layer needs to be applied to generate a classifier equal to the number of classes required. The working principle of the fully connected layer is very similar to the previous neural network learning, and the tensor output by the pooling layer needs to be re-cut. Form some vectors, multiply the weight matrix, add the bias value, and then use the ReLU activation function to optimize the parameters with the gradient descent method.

S106，通过softmax分类器针对所述整合特征进行学习得到所述待分类投资项目的分类标签。S106. Obtain classification labels of the investment items to be classified by learning the integrated features through a softmax classifier.

具体地，softmax函数在机器学习中是常用的多分类器，特别是在卷积神经网络中，最后的一层经常都是使用softmax分类器进行多类别分类任务。softmax函数是logistic函数的一般形式，是将分类问题转化为概率问题，就是求解统计所有可能的概率，然后概率最大的即认为为该类别。Specifically, the softmax function is a commonly used multi-classifier in machine learning, especially in convolutional neural networks, where the last layer often uses a softmax classifier for multi-category classification tasks. The softmax function is the general form of the logistic function, which converts the classification problem into a probability problem, that is, solves and counts all possible probabilities, and then the one with the highest probability is considered as the category.

进一步地，步骤S106的具体步骤包括：Further, the specific steps of step S106 include:

S1061，采用多类交叉熵函数作为卷积神经网络模型的损失函数；S1061, using a multi-class cross-entropy function as a loss function of the convolutional neural network model;

S1062，将所述整合特征通过卷积神经网络模型计算以输出对应的分类标签。S1062. Calculate the integrated feature through a convolutional neural network model to output a corresponding classification label.

综上所述，采用NLP中的通用语言模型BERT模型，能捕捉整个句子中文本信息的字符序列信息、上下文关系信息、语法语境信息等，解决了一词多义问题。且Bert模型使用转换器的编码器可并行执行运算，可叠加多层，对文本信息有很强的表征能力，其输出的词向量能够非常好的表征文本信息的特征，接着将其作为下游自然语言处理任务的模型参数或者模型输入以提高模型的整体性能。后接CNN层，利用不同大小的卷积核捕捉句子中不同长度词的信息，中文词语蕴含的信息往往比字更丰富，所以把词语信息从整句数据中提取出来后再进行分类，分类结果更理想，实现提升投资项目标签分类的准确度、全面性及效率性。To sum up, the BERT model, the general language model in NLP, can capture the character sequence information, contextual relationship information, grammatical context information, etc. of the text information in the entire sentence, and solve the polysemy problem of a word. Moreover, the Bert model uses the encoder of the converter to perform operations in parallel, and can superimpose multiple layers. It has a strong ability to represent text information. The word vector output by it can very well represent the characteristics of text information, and then use it as a downstream natural Model parameters or model inputs for language processing tasks to improve the overall performance of the model. Followed by the CNN layer, convolution kernels of different sizes are used to capture the information of words of different lengths in the sentence. The information contained in Chinese words is often richer than that of words, so the word information is extracted from the entire sentence data and then classified. The classification results More ideally, to improve the accuracy, comprehensiveness and efficiency of investment project label classification.

实施例2Example 2

本实施例提供了与实施例1所述方法相对应的系统的结构框图。图2是根据本实施例的项目标签预测系统的结构框图，如图2所示，该系统包括：This embodiment provides a structural block diagram of a system corresponding to the method described in Embodiment 1. Fig. 2 is a structural block diagram of the item label prediction system according to the present embodiment, as shown in Fig. 2, the system includes:

表征模块10，用于获取表征待分类投资项目的关键文本所对应的字符序列。The characterizing module 10 is configured to acquire character sequences corresponding to key texts representing investment items to be classified.

映射模块20，用于将所述字符序列通过映射方式转换成若干嵌入表示，并将若干嵌入表示叠加得到项目信息词序列；其中，所述嵌入表示包括字符嵌入、位置嵌入及句子类型嵌入。The mapping module 20 is used for converting the character sequence into several embedded representations through mapping, and superimposing the several embedded representations to obtain the project information word sequence; wherein, the embedded representations include character embedding, position embedding and sentence type embedding.

语义模块30，用于通过Bert语言模型处理所述项目信息词序列输出词向量矩阵。The semantic module 30 is configured to process the item information word sequence through the Bert language model and output a word vector matrix.

处理模块40，用于针对所述词向量矩阵进行局部特征提取，并将提取的局部特征归一化处理得到池化结果。The processing module 40 is configured to perform local feature extraction on the word vector matrix, and normalize the extracted local features to obtain a pooling result.

变换模块50，用于采用全连接对拼接后的所述池化结果进行变换处理得到整合特征；具体地，所述全连接应用dropout策略以使部分神经元的激活概率固定在p值上，其中，p值的取值范围为0～1。The transformation module 50 is used to transform the spliced pooling results by using the full connection to obtain integrated features; specifically, the full connection applies a dropout strategy to fix the activation probability of some neurons on the p value, where , and the p-value ranges from 0 to 1.

分类模块60，用于通过softmax分类器针对所述整合特征进行学习得到所述待分类投资项目的分类标签。The classification module 60 is configured to learn the integrated features through a softmax classifier to obtain the classification labels of the investment items to be classified.

较佳地，所述表征模块10包括：Preferably, the characterization module 10 includes:

串接单元11，用于将待分类投资项目的项目名称、主要建设内容及行业领域进行串接，得到所述待分类投资项目的关键文本；The concatenation unit 11 is used to concatenate the project names, main construction contents and industry fields of the investment projects to be classified to obtain the key text of the investment projects to be classified;

去除单元12，用于将所述关键文本中的停用词进行去除得到字符组；The removal unit 12 is used to remove the stop words in the key text to obtain the character group;

拼接单元13，用于将所述字符组中的前n个词与标识符进行拼接，并将所述标识符置于首位，以形成所述关键文本对应的字符序列。The concatenating unit 13 is configured to concatenate the first n words in the character group with identifiers, and place the identifiers at the first place, so as to form a character sequence corresponding to the key text.

较佳地，所述语义模块30包括：Preferably, the semantic module 30 includes:

转化单元31，用于将所述项目信息词序列转化成unicode，并通过Unicode码位去除所述unicode中不合法字符及多余空格，得到信息词字符串；The conversion unit 31 is used to convert the item information word sequence into unicode, and remove illegal characters and redundant spaces in the unicode by the Unicode code point to obtain the information word string;

循环单元32，用于通过空格将所述信息词字符串中的中文字符进行分隔，并进行循环strip()操作，得到初始分词结果；The loop unit 32 is used to separate the Chinese characters in the information word string by spaces, and perform a loop strip() operation to obtain the initial word segmentation result;

深处理单元33，用于针对初始分词结果进行深处理得到目标分词结果；The deep processing unit 33 is used for performing deep processing on the initial word segmentation result to obtain the target word segmentation result;

拆分单元34，用于将所述目标分词结果中的英文按照预设拆分原则进行拆分，得到词向量矩阵；其中，所述预设拆分原则具体为：将英文按照subword词表进行拆分，每个单词拆分后的subword尽可能地长，采用贪婪最长优先匹配算法，对于每个单词，指针i＝0、j＝len从后向前匹配，直至单词的前缀[i:j]是subword词表中的一个subword，则将其取出，进而设置i＝j、j＝len，循环上述流程。The splitting unit 34 is used to split the English in the target word segmentation result according to a preset splitting principle to obtain a word vector matrix; wherein, the preset splitting principle is specifically: carry out English according to the subword vocabulary Splitting, the subword after splitting each word is as long as possible, using the greedy longest-first matching algorithm, for each word, the pointer i=0, j=len are matched from back to front until the prefix of the word [i: j] is a subword in the subword vocabulary, then it is taken out, and then i=j, j=len are set, and the above-mentioned process is circulated.

较佳地，所述处理模块40包括：Preferably, the processing module 40 includes:

提取单元41，用于利用卷积神经网络模型针对所述词向量矩阵进行局部特征提取，得到特征提取结果；The extraction unit 41 is used to perform local feature extraction for the word vector matrix using a convolutional neural network model to obtain a feature extraction result;

池化单元42，用于通过最大值池化操作将所述特征提取结果进行特征归一化，以选取局部最优特征得到池化结果。The pooling unit 42 is configured to perform feature normalization on the feature extraction result through a maximum pooling operation, so as to select local optimal features to obtain a pooling result.

较佳地，所述分类模块60包括：Preferably, the classification module 60 includes:

定义单元61，用于采用多类交叉熵函数作为卷积神经网络模型的损失函数；Definition unit 61, for adopting multiclass cross-entropy function as the loss function of convolutional neural network model;

分类单元62，用于将所述整合特征通过卷积神经网络模型计算以输出对应的分类标签。The classification unit 62 is configured to calculate the integrated feature through a convolutional neural network model to output a corresponding classification label.

需要说明的是，上述各个模块可以是功能模块也可以是程序模块，既可以通过软件来实现，也可以通过硬件来实现。对于通过硬件来实现的模块而言，上述各个模块可以位于同一处理器中；或者上述各个模块还可以按照任意组合的形式分别位于不同的处理器中。It should be noted that each of the above-mentioned modules may be a function module or a program module, and may be realized by software or by hardware. For the modules implemented by hardware, the above modules may be located in the same processor; or the above modules may be located in different processors in any combination.

实施例3Example 3

结合图1所描述的项目标签预测方法可以由电子设备来实现。图3为根据本实施例的电子设备的硬件结构示意图。The item label prediction method described in conjunction with FIG. 1 can be implemented by electronic devices. FIG. 3 is a schematic diagram of a hardware structure of an electronic device according to this embodiment.

电子设备可以包括处理器71以及存储有计算机程序指令的存储器72。The electronic device may comprise a processor 71 and a memory 72 storing computer program instructions.

具体地，上述处理器71可以包括中央处理器(CPU)，或者特定集成电路(Application Specific Integrated Circuit，简称为ASIC)，或者可以被配置成实施本申请实施例的一个或多个集成电路。Specifically, the processor 71 may include a central processing unit (CPU), or an Application Specific Integrated Circuit (ASIC for short), or may be configured to implement one or more integrated circuits in the embodiments of the present application.

其中，存储器72可以包括用于数据或指令的大容量存储器。举例来说而非限制，存储器72可包括硬盘驱动器(Hard Disk Drive，简称为HDD)、软盘驱动器、固态驱动器(SolidState Drive，简称为SSD)、闪存、光盘、磁光盘、磁带或通用串行总线(Universal SerialBus，简称为USB)驱动器或者两个或更多个以上这些的组合。在合适的情况下，存储器72可包括可移除或不可移除(或固定)的介质。在合适的情况下，存储器72可在数据处理装置的内部或外部。在特定实施例中，存储器72是非易失性(Non-Volatile)存储器。在特定实施例中，存储器72包括只读存储器(Read-Only Memory，简称为ROM)和随机存取存储器(RandomAccess Memory，简称为RAM)。在合适的情况下，该ROM可以是掩模编程的ROM、可编程ROM(Programmable Read-Only Memory，简称为PROM)、可擦除PROM(Erasable ProgrammableRead-Only Memory，简称为EPROM)、电可擦除PROM(Electrically Erasable ProgrammableRead-Only Memory，简称为EEPROM)、电可改写ROM(Electrically Alterable Read-OnlyMemory，简称为EAROM)或闪存(FLASH)或者两个或更多个以上这些的组合。在合适的情况下，该RAM可以是静态随机存取存储器(Static Random-Access Memory，简称为SRAM)或动态随机存取存储器(Dynamic Random Access Memory，简称为DRAM)，其中，DRAM可以是快速页模式动态随机存取存储器(Fast Page Mode Dynamic Random Access Memory，简称为FPMDRAM)、扩展数据输出动态随机存取存储器(Extended Date Out Dynamic RandomAccess Memory，简称为EDODRAM)、同步动态随机存取内存(Synchronous Dynamic Random-Access Memory，简称SDRAM)等。Wherein, the memory 72 may include a mass memory for data or instructions. For example without limitation, memory 72 may include a hard disk drive (Hard Disk Drive, referred to as HDD), a floppy disk drive, a solid state drive (SolidState Drive, referred to as SSD), flash memory, optical disc, magneto-optical disc, magnetic tape or universal serial bus (Universal Serial Bus, referred to as USB) drive or a combination of two or more of the above. Memory 72 may comprise removable or non-removable (or fixed) media, where appropriate. Memory 72 may be internal or external to the data processing arrangement, where appropriate. In a particular embodiment, memory 72 is a non-volatile (Non-Volatile) memory. In a specific embodiment, the memory 72 includes a read-only memory (Read-Only Memory, referred to as ROM) and a random access memory (Random Access Memory, referred to as RAM). In appropriate cases, the ROM can be mask programmed ROM, programmable ROM (Programmable Read-Only Memory, referred to as PROM), erasable PROM (Erasable Programmable Read-Only Memory, referred to as EPROM), electrically erasable In addition to PROM (Electrically Erasable Programmable Read-Only Memory, referred to as EEPROM), electrically rewritable ROM (Electrically Alterable Read-Only Memory, referred to as EAROM) or flash memory (FLASH) or a combination of two or more of these. Where appropriate, the RAM can be a Static Random-Access Memory (SRAM for short) or a Dynamic Random-Access Memory (DRAM for short), where the DRAM can be a fast page Mode Dynamic Random Access Memory (Fast Page Mode Dynamic Random Access Memory, referred to as FPMDRAM), Extended Data Output Dynamic Random Access Memory (Extended Date Out Dynamic Random Access Memory, referred to as EDODRAM), Synchronous Dynamic Random Access Memory (Synchronous Dynamic Random-Access Memory, referred to as SDRAM) and so on.

存储器72可以用来存储或者缓存需要处理和/或通信使用的各种数据文件，以及处理器71所执行的可能的计算机程序指令。The memory 72 may be used to store or cache various data files required for processing and/or communication, as well as possible computer program instructions executed by the processor 71 .

处理器71通过读取并执行存储器72中存储的计算机程序指令，以实现上述实施例1的项目标签预测方法。The processor 71 reads and executes the computer program instructions stored in the memory 72 to implement the item label prediction method of the first embodiment above.

在其中一些实施例中，电子设备还可包括通信接口73和总线70。其中，如图3所示，处理器71、存储器72、通信接口73通过总线70连接并完成相互间的通信。In some of these embodiments, the electronic device may further include a communication interface 73 and a bus 70 . Wherein, as shown in FIG. 3 , the processor 71 , the memory 72 , and the communication interface 73 are connected through the bus 70 and complete mutual communication.

通信接口73用于实现本申请实施例中各模块、装置、单元和/或设备之间的通信。通信接口73还可以实现与其他部件例如：外接设备、图像/数据采集设备、数据库、外部存储以及图像/数据处理工作站等之间进行数据通信。The communication interface 73 is used to realize the communication between various modules, devices, units and/or devices in the embodiment of the present application. The communication interface 73 can also implement data communication with other components such as external devices, image/data acquisition equipment, databases, external storage, and image/data processing workstations.

总线70包括硬件、软件或两者，将设备的部件彼此耦接在一起。总线70包括但不限于以下至少之一：数据总线(Data Bus)、地址总线(Address Bus)、控制总线(ControlBus)、扩展总线(Expansion Bus)、局部总线(Local Bus)。举例来说而非限制，总线70可包括图形加速接口(Accelerated Graphics Port，简称为AGP)或其他图形总线、增强工业标准架构(Extended Industry Standard Architecture，简称为EISA)总线、前端总线(FrontSide Bus，简称为FSB)、超传输(Hyper Transport，简称为HT)互连、工业标准架构(Industry Standard Architecture，简称为ISA)总线、无线带宽(InfiniBand)互连、低引脚数(Low Pin Count，简称为LPC)总线、存储器总线、微信道架构(Micro ChannelArchitecture，简称为MCA)总线、外围组件互连(Peripheral Component Interconnect，简称为PCI)总线、PCI-Express(PCI-X)总线、串行高级技术附件(Serial AdvancedTechnology Attachment，简称为SATA)总线、视频电子标准协会局部(Video ElectronicsStandards Association Local Bus，简称为VLB)总线或其他合适的总线或者两个或更多个以上这些的组合。在合适的情况下，总线70可包括一个或多个总线。尽管本申请实施例描述和示出了特定的总线，但本申请考虑任何合适的总线或互连。Bus 70 includes hardware, software, or both, and couples the components of the device to each other. The bus 70 includes but is not limited to at least one of the following: a data bus (Data Bus), an address bus (Address Bus), a control bus (ControlBus), an expansion bus (Expansion Bus), and a local bus (Local Bus). For example and without limitation, the bus 70 may include an Accelerated Graphics Port (AGP for short) or other graphics bus, an Enhanced Industry Standard Architecture (Extended Industry Standard Architecture, EISA for short) bus, a Front Side Bus (FrontSide Bus, Referred to as FSB), Hyper Transport (Hyper Transport, referred to as HT) interconnection, Industry Standard Architecture (Industry Standard Architecture, referred to as ISA) bus, wireless bandwidth (InfiniBand) interconnection, low pin count (Low Pin Count, referred to as LPC) bus, memory bus, Micro Channel Architecture (MCA) bus, Peripheral Component Interconnect (PCI) bus, PCI-Express (PCI-X) bus, serial advanced technology Attachment (Serial Advanced Technology Attachment, referred to as SATA) bus, Video Electronics Standards Association local (Video Electronics Standards Association Local Bus, referred to as VLB) bus or other suitable bus or a combination of two or more of these. Bus 70 may comprise one or more buses, where appropriate. Although the embodiments of this application describe and illustrate a particular bus, this application contemplates any suitable bus or interconnect.

该电子设备可以获取到项目标签预测系统，执行本实施例1的项目标签预测方法。The electronic device can obtain the item label prediction system, and execute the item label prediction method in Embodiment 1.

另外，结合上述实施例1中的项目标签预测方法，本申请实施例可提供一种存储介质来实现。该存储介质上存储有计算机程序指令；该计算机程序指令被处理器执行时实现上述实施例1的项目标签预测方法。In addition, in combination with the method for predicting item tags in Embodiment 1 above, this embodiment of the present application may provide a storage medium for implementation. Computer program instructions are stored on the storage medium; when the computer program instructions are executed by a processor, the method for predicting item labels in the first embodiment above is implemented.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-mentioned embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, should be considered as within the scope of this specification.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.

Claims

1. A project label prediction method, characterized in that, comprising:

Obtaining the character sequence corresponding to the key text characterizing the investment project to be classified;

Converting the character sequence into several embedding representations by means of mapping, and superimposing the several embedding representations to obtain the item information word sequence; wherein, the embedding representations include character embedding, position embedding and sentence type embedding;

Process the item information word sequence output word vector matrix by Bert language model;

Performing local feature extraction for the word vector matrix, and normalizing the extracted local features to obtain a pooling result;

Transforming the spliced pooling results by full connection to obtain integrated features;

A softmax classifier is used to learn the integrated features to obtain the classification labels of the investment items to be classified.

2. The project label prediction method according to claim 1, wherein the step of obtaining the character sequence corresponding to the key text representing the investment project to be classified specifically comprises:

Concatenate the project name, main construction content and industry field of the investment project to be classified to obtain the key text of the investment project to be classified;

removing stop words in the key text to obtain character groups;

The first n words in the character group are spliced with identifiers, and the identifiers are placed in the first place to form a character sequence corresponding to the key text.

3. The item label prediction method according to claim 1, wherein the step of processing the item information word sequence output word vector matrix by the Bert language model specifically comprises:

Converting the item information word sequence into unicode, and removing illegal characters and redundant spaces in the unicode by Unicode code points, to obtain the information word string;

The Chinese characters in the information word string are separated by spaces, and the loop strip() operation is performed to obtain the initial word segmentation result;

Perform deep processing on the initial word segmentation results to obtain the target word segmentation results;

The English in the target word segmentation result is split according to the preset splitting principle to obtain a word vector matrix.

4. The item label prediction method according to claim 3, wherein the preset splitting principle is specifically:

Split the English language according to the subword vocabulary, the subword after each word is split as long as possible, adopt the greedy longest-first matching algorithm, for each word, the pointer i=0, j=len are matched from the back to the front, Until the prefix [i:j] of the word is a subword in the subword vocabulary, it is taken out, and then i=j, j=len are set, and the above-mentioned process is circulated.

5. The item label prediction method according to claim 1, wherein the step of performing local feature extraction for the word vector matrix, and normalizing the extracted local features to obtain a pooling result specifically includes:

Using a convolutional neural network model to perform local feature extraction on the word vector matrix to obtain a feature extraction result;

The feature extraction result is subjected to feature normalization through a maximum pooling operation, so as to select a local optimal feature to obtain a pooling result.

6. The item label prediction method according to claim 1, wherein the full connection applies a dropout strategy to fix the activation probability of some neurons on the p value, wherein the p value ranges from 0 to 1.

7. The project label prediction method according to claim 1, wherein the specific steps of learning the classification label of the investment project to be classified by the softmax classifier for the integration feature include:

The multi-class cross-entropy function is used as the loss function of the convolutional neural network model;

The integrated feature is calculated by a convolutional neural network model to output a corresponding classification label.

8. A project tag prediction system, comprising:

A characterizing module, configured to obtain character sequences corresponding to key texts characterizing investment projects to be classified;

The mapping module is used to convert the character sequence into several embedded representations by mapping, and superimpose several embedded representations to obtain the item information word sequence; wherein, the embedded representations include character embedding, position embedding and sentence type embedding;

Semantic module, for processing described item information word sequence output word vector matrix by Bert language model;

A processing module, configured to extract local features for the word vector matrix, and normalize the extracted local features to obtain a pooling result;

A transform module, configured to transform the spliced pooling results using full connections to obtain integrated features;

The classification module is used to learn the integrated features through a softmax classifier to obtain the classification labels of the investment items to be classified.

9. An electronic device, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, characterized in that, when the processor executes the computer program, the computer program according to claim The item label prediction method described in any one of 1 to 7.

10. A storage medium, on which a computer program is stored, wherein when the program is executed by a processor, the item label prediction method according to any one of claims 1 to 7 is implemented.