CN114911929A

CN114911929A - Classification model training method, text mining method, equipment and storage medium

Info

Publication number: CN114911929A
Application number: CN202210372329.5A
Authority: CN
Inventors: 陈志优; 李健; 陈明; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2022-04-11
Filing date: 2022-04-11
Publication date: 2022-08-16

Abstract

The invention discloses a classification model training method, a text mining method, equipment and a medium, and relates to the technical field of computers. The training method uses the acquired plurality of dialogue texts as training data, performs training on the classification model according to the training data, and obtains the classification model that has been trained in stages. Among them, the training data is screened by means of cluster analysis, and the training data is marked with scene categories, so that the amount of training data marked can be greatly reduced. And, according to the difference information determined by the training data, it is judged whether the classification model that has been trained in the said stage needs to continue training. Therefore, when the difference information meets the conditions of scene mining, more scene categories can be mined through continuous training, so that in the process of text mining, more detailed scene categories and related texts with less information can be mined, which is convenient for Statistical analysis of relevant dialogue texts.

Description

Classification model training method, text mining method, equipment and storage medium

技术领域technical field

本发明涉及计算机技术领域，特别是涉及一种分类模型训练方法、文本挖掘方法、设备及存储介质。The present invention relates to the field of computer technology, in particular to a classification model training method, text mining method, equipment and storage medium.

背景技术Background technique

在一些线上业务场景中，业务人员为客户提供业务咨询时，会保留大量的咨询数据。对于资历较浅的业务人员来说，咨询数据中的体现出专业性的话术是值得学习的。在现有技术中，普遍会使用分类模型对咨询数据进行统计分析，从而可以基于不同场景类别进行学习。但是，分类模型在训练过程中需要大量的标注数据，且在固定场景分类后，缺乏发现新场景的能力。In some online business scenarios, when business personnel provide business consulting to customers, they will retain a large amount of consulting data. For junior business personnel, the professional words reflected in the consulting data are worth learning. In the prior art, a classification model is generally used to perform statistical analysis on consultation data, so that learning can be performed based on different scene categories. However, the classification model requires a large amount of labeled data in the training process, and lacks the ability to discover new scenes after a fixed scene is classified.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的分类模型训练方法、文本挖掘方法、设备及存储介质。In view of the above problems, the present invention is proposed to provide a classification model training method, text mining method, device and storage medium that overcome the above problems or at least partially solve the above problems.

依据本发明的第一方面，提供了一种分类模型训练方法，所述方法包括：According to a first aspect of the present invention, a classification model training method is provided, the method comprising:

获取多个对话文本作为训练数据；Obtain multiple dialogue texts as training data;

依据所述训练数据对分类模型执行训练，得到阶段训练完成的分类模型；Perform training on the classification model according to the training data, and obtain the classification model completed by stage training;

基于所述训练数据确定差异信息，依据所述差异信息判断所述阶段训练完成的分类模型是否符合场景挖掘条件；Determine difference information based on the training data, and judge according to the difference information whether the classification model trained in the stage meets the scene mining conditions;

若所述差异信息符合场景挖掘条件，则依据所述训练数据和阶段训练完成的分类模型更新训练数据并继续训练；If the difference information meets the scene mining conditions, update the training data and continue training according to the training data and the classification model completed by the stage training;

若所述差异信息不符合场景挖掘条件，则将所述阶段训练完成的分类模型作为训练完成的分类模型；If the difference information does not meet the scene mining conditions, the classification model trained in the stage is used as the classification model after the training;

其中，依据所述训练数据对分类模型执行训练，得到阶段训练完成的分类模型，包括：Wherein, the classification model is trained according to the training data, and the classification model completed by stage training is obtained, including:

对所述训练数据进行聚类分析，确定多个训练类簇；Perform cluster analysis on the training data to determine a plurality of training clusters;

确定训练类簇对应的场景类别，将所述训练类簇中文本标注为所述场景类别；Determine the scene category corresponding to the training cluster, and mark the text in the training cluster as the scene category;

采用标注的文本数据训练分类模型，得到阶段训练完成的分类模型。Use the labeled text data to train the classification model, and obtain the classification model that has been trained in stages.

依据本发明的第二方面，还提供了一种文本挖掘方法，所述方法包括：According to a second aspect of the present invention, there is also provided a text mining method, the method comprising:

接收对话信息，从所述对话信息中获取第一用户的对话文本；receiving dialogue information, and obtaining the dialogue text of the first user from the dialogue information;

将所述对话文本输入到分类模型中进行分类识别，确定出对应的目标场景类别，所述分类模型通过训练数据执行训练，得到阶段训练完成的分类模型，并基于所述训练数据确定差异信息，判断所述阶段训练完成的分类模型是否符合场景挖掘条件，依据判断结果确定是否更新训练数据继续训练阶段训练完成的分类模型得到；Inputting the dialogue text into a classification model for classification and identification, and determining the corresponding target scene category, the classification model performs training through the training data to obtain a classification model completed by stage training, and determines the difference information based on the training data, Judging whether the classification model trained in the stage meets the scene mining conditions, and determining whether to update the training data according to the judgment result and continuing the classification model trained in the training stage to obtain;

查询所述目标场景类别对应的目标回复文本；query the target reply text corresponding to the target scene category;

采用所述目标回复文本作为第二用户的对话文本，反馈所述第一用户的对话文本。The target reply text is used as the dialogue text of the second user, and the dialogue text of the first user is fed back.

依据本发明的第三方面，还提供了一种电子设备，包括：According to a third aspect of the present invention, an electronic device is also provided, comprising:

一个或多个处理器；one or more processors;

存储器；memory;

一个或多个程序，其中所述一个或多个程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行，所述一个或多个程序配置用于执行上述任一所述的方法。one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs are configured to perform any of the above the method described.

依据本发明的第四方面，还提供了一种计算机可读存储介质，存储与电子设备结合使用的计算机程序，所述计算机程序可被处理器执行以完成上述任一所述的方法。According to a fourth aspect of the present invention, a computer-readable storage medium is also provided, which stores a computer program used in conjunction with an electronic device, and the computer program can be executed by a processor to perform any of the above-mentioned methods.

本发明方案中，将获取的多个对话文本作为训练数据，依据所述训练数据对分类模型执行训练，得到阶段训练完成的分类模型。其中，采用聚类分析的方式对训练数据进行筛选，以场景类别对训练数据进行标注，从而能够大大降低训练数据的标注量。并且，根据训练数据确定的差异信息判断所述阶段训练完成的分类模型是否需要继续训练。能够在差异信息符合场景挖掘条件的情况下，通过继续训练来挖掘出更多的场景类别。使得文本挖掘过程中，能够挖掘出更细化的场景类别且信息量少的相关文本，从而便于相关对话文本的统计分析。In the solution of the present invention, a plurality of acquired dialogue texts are used as training data, and the classification model is trained according to the training data to obtain a classification model that has been trained in stages. Among them, the training data is screened by means of cluster analysis, and the training data is marked with scene categories, so that the amount of marking of the training data can be greatly reduced. And, according to the difference information determined by the training data, it is judged whether the classification model after the training in the said stage needs to continue training. When the difference information meets the conditions for scene mining, more scene categories can be mined by continuing training. In the process of text mining, more detailed scene categories and related texts with less information can be mined, so as to facilitate statistical analysis of related dialogue texts.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, in order to be able to understand the technical means of the present invention more clearly, it can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand , the following specific embodiments of the present invention are given.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for the purpose of illustrating preferred embodiments only and are not to be considered limiting of the invention. Also, the same components are denoted by the same reference numerals throughout the drawings.

在附图中：In the attached image:

图1是本发明实施例提供的一种分类模型训练方法的步骤流程图；Fig. 1 is a flow chart of steps of a classification model training method provided by an embodiment of the present invention;

图2是本发明实施例提供的另一种分类模型训练方法的步骤流程图；2 is a flowchart of steps of another classification model training method provided by an embodiment of the present invention;

图3是本发明实施例提供的一种文本挖掘方法的步骤流程图；3 is a flowchart of steps of a text mining method provided by an embodiment of the present invention;

图4是本发明实施例提供的第一客户端的显示页面的显示内容示意图；4 is a schematic diagram of display content of a display page of a first client provided by an embodiment of the present invention;

图5是本发明实施例提供的第一客户端的显示页面的一种显示内容示意图；5 is a schematic diagram of a display content of a display page of a first client provided by an embodiment of the present invention;

图6是本发明实施例提供的第一客户端的显示页面的另一种显示内容示意图；6 is a schematic diagram of another display content of a display page of a first client provided by an embodiment of the present invention;

图7是本发明实施例提供的客户端的显示页面的一种显示内容示意图；7 is a schematic diagram of a display content of a display page of a client according to an embodiment of the present invention;

图8是本发明实施例提供的客户端的显示页面的另一种显示内容示意图；8 is a schematic diagram of another display content of a display page of a client according to an embodiment of the present invention;

图9是本发明实施例提供的一种分类模型训练装置的框图；9 is a block diagram of a classification model training apparatus provided by an embodiment of the present invention;

图10是本发明实施例提供的一种文本据挖掘装置的框图。FIG. 10 is a block diagram of a text data mining apparatus provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本发明的示例性实施例。虽然附图中显示了本发明的示例性实施例，然而应当理解，可以以各种形式实现本发明而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本发明，并且能够将本发明的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present invention will be more thoroughly understood, and will fully convey the scope of the present invention to those skilled in the art.

参照图1，示出了本发明实施例提供的一种分类模型训练方法，所述方法可以包括：Referring to FIG. 1, a classification model training method provided by an embodiment of the present invention is shown, and the method may include:

S101、获取多个对话文本作为训练数据。S101. Acquire multiple dialogue texts as training data.

S102、依据所述训练数据对分类模型执行训练，得到阶段训练完成的分类模型。S102. Perform training on the classification model according to the training data, to obtain a classification model that has been trained in stages.

本发明实施例中，一些语音对话场景或文本对话场景中，在经过客户的授权情况下，会对对话过程中的对话记录进行存储。由此，可以基于采集到的若干段对话记录，确定出多个对话文本。其中，可以将客户作为第一用户，将业务人员作为第二用户。若对话记录以语音数据的形式存储，可以将对话记录进行语音转换，从而得到对应的对话文本。在得到若干个对话文本之后，将其作为训练数据，对预设的分类模型执行训练，直到所述分类模型符合第一训练条件，停止对所述分类模型的训练，将停止训练后的分类模型确定为阶段训练完成的分类模型。一种示例中，第一训练条件可以是分类模型的损失函数不再下降或下降的幅度很小。In the embodiment of the present invention, in some speech dialogue scenarios or text dialogue scenarios, the dialogue records in the dialogue process are stored under the authorization of the customer. Thus, a plurality of dialogue texts can be determined based on the collected dialogue records. Among them, the customer can be regarded as the first user, and the business personnel can be regarded as the second user. If the dialogue record is stored in the form of voice data, the dialogue record can be voice-converted to obtain the corresponding dialogue text. After obtaining several dialogue texts, use them as training data, and perform training on the preset classification model until the classification model meets the first training condition, stop the training of the classification model, and stop the trained classification model Identify the classification model trained for the stage. In one example, the first training condition may be that the loss function of the classification model no longer decreases or decreases by a small amount.

依据所述训练数据对分类模型执行训练，得到阶段训练完成的分类模型，包括以下步骤：Perform training on the classification model according to the training data, and obtain the classification model completed by stage training, including the following steps:

对所述训练数据进行聚类分析，确定多个训练类簇。Cluster analysis is performed on the training data to determine a plurality of training clusters.

确定训练类簇对应的场景类别，将所述训练类簇中文本标注为所述场景类别。The scene category corresponding to the training cluster is determined, and the text in the training cluster is marked as the scene category.

本发明实施例中，可以对训练数据进行聚类分析，确定出多个训练类簇，并从多个训练类簇中确定出对应的场景类别。其中，各场景类别可以是根据各训练类簇中的文本语义信息确定。在确定出所述训练类簇对应的场景类别后，可以对每个训练类簇中文本标注为对应的场景类别。通过对每个训练类簇进行场景类别的标注，从而形成带有标注的文本数据。由此，避免了对每个对话文本进行标注，大大降低了训练数据的标注量。然后将标注的文本数据对预设的分类模型进行训练，得到阶段训练完成的分类模型。In this embodiment of the present invention, the training data can be clustered and analyzed to determine a plurality of training clusters, and a corresponding scene category can be determined from the plurality of training clusters. Wherein, each scene category may be determined according to the textual semantic information in each training cluster. After the scene category corresponding to the training cluster is determined, the text in each training cluster may be marked as the corresponding scene category. Annotated text data is formed by annotating scene categories for each training cluster. As a result, labeling each dialogue text is avoided, which greatly reduces the amount of labeling of training data. Then, the labeled text data is trained on the preset classification model, and the classification model completed in the stage training is obtained.

S103、基于所述训练数据确定差异信息，依据所述差异信息判断所述阶段训练完成的分类模型是否符合场景挖掘条件。S103. Determine difference information based on the training data, and judge whether the classification model completed in the stage of training meets the scene mining conditions according to the difference information.

S104、若所述差异信息符合场景挖掘条件，则依据所述训练数据和阶段训练完成的分类模型更新训练数据并继续训练。S104. If the difference information meets the scene mining conditions, update the training data according to the training data and the classification model completed by the stage training, and continue the training.

S105、若所述差异信息不符合场景挖掘条件，则将所述阶段训练完成的分类模型作为训练完成的分类模型。S105. If the difference information does not meet the scene mining conditions, use the classification model trained in the stage as the classification model after training.

本发明实施例中，在确定出训练类簇对应的场景类别之后，可以确定同一训练类簇中文本数据的差异信息。其中，差异信息可以理解为不同文本的语义信息的语义相似度。由此，依据所述差异信息可以判断所述阶段训练完成的分类模型是否符合场景挖掘条件。In the embodiment of the present invention, after the scene category corresponding to the training cluster is determined, the difference information of the text data in the same training cluster can be determined. Among them, the difference information can be understood as the semantic similarity of the semantic information of different texts. In this way, it can be determined whether the classification model trained in the stage meets the scene mining conditions according to the difference information.

一种示例中，所述场景挖掘条件用于评估所述阶段训练完成的分类模型的场景类别是否需要进行细化。例如，在同一训练类簇中不同文本的语义信息的语义相似度较高时，说明对应的两个文本之间的界限不清晰，分类效果不明显。由此，可以预设一个语义阈值，在所述语义相似度小于所述语义阈值时，确定所述差异信息不符合场景挖掘条件，也就是分类效果明显，因此将所述阶段训练完成的分类模型直接作为训练完成的分类模型。在所述语义相似度等于或大于所述语义阈值时，确定所述差异信息符合场景挖掘条件，此时需要更新训练数据，并将更新后的训练数据继续训练所述阶段训练完成的分类模型，直到继续训练的阶段训练完成的分类模型符合第一训练条件，停止对分类模型的训练，将停止继续训练后的阶段训练完成的分类模型确定为训练完成的分类模型。In one example, the scene mining condition is used to evaluate whether the scene category of the classification model trained in the stage needs to be refined. For example, when the semantic similarity of the semantic information of different texts in the same training cluster is high, it means that the boundary between the corresponding two texts is not clear, and the classification effect is not obvious. Therefore, a semantic threshold can be preset, and when the semantic similarity is smaller than the semantic threshold, it is determined that the difference information does not meet the scene mining conditions, that is, the classification effect is obvious, so the classification model trained in the above stage is used. directly as a trained classification model. When the semantic similarity is equal to or greater than the semantic threshold, it is determined that the difference information complies with the scene mining conditions, and the training data needs to be updated at this time, and the updated training data is continued to train the classification model that has been trained at the stage, Until the classification model trained in the stage of continuing training meets the first training condition, the training of the classification model is stopped, and the classification model trained in the stage after the continuous training is stopped is determined as the classification model that has been trained.

综上，本发明实施例提供的一种分类模型训练方法，所述方法将获取的多个对话文本作为训练数据，依据所述训练数据对分类模型执行训练，得到阶段训练完成的分类模型。其中，采用聚类分析的方式对训练数据进行筛选，以场景类别对训练数据进行标注，从而能够大大降低训练数据的标注量。并且，根据训练数据确定的差异信息判断所述阶段训练完成的分类模型是否需要继续训练。从而能够在差异信息符合场景挖掘条件的情况下，通过继续训练来挖掘出更多的场景类别，使得文本挖掘过程中，能够挖掘出更细化的场景类别且信息量少的相关文本，从而便于相关对话文本的统计分析。In summary, the embodiment of the present invention provides a classification model training method. The method uses a plurality of acquired dialogue texts as training data, performs training on a classification model according to the training data, and obtains a classification model that has been trained in stages. Among them, the training data is screened by means of cluster analysis, and the training data is marked with scene categories, so that the amount of marking of the training data can be greatly reduced. And, according to the difference information determined by the training data, it is judged whether the classification model after the training in the said stage needs to continue training. Therefore, when the difference information meets the conditions for scene mining, more scene categories can be mined by continuing training, so that in the process of text mining, more detailed scene categories and related texts with less information can be mined, which is convenient for Statistical analysis of relevant dialogue texts.

参照图2，示出了本发明实施例提供的另一种分类模型训练方法，所述方法可以包括：Referring to FIG. 2, another classification model training method provided by an embodiment of the present invention is shown, and the method may include:

S201、获取多个对话文本作为训练数据。S201. Acquire multiple dialogue texts as training data.

S202、依据所述训练数据对分类模型执行训练，得到阶段训练完成的分类模型。S202. Perform training on the classification model according to the training data, and obtain a classification model that has been trained in stages.

本发明实施例中，在得到若干个对话文本之后，将其作为训练数据，对预设的分类模型执行训练，直到所述分类模型符合第一训练条件，停止对所述分类模型的训练，将停止训练后的分类模型确定为阶段训练完成的分类模型。In the embodiment of the present invention, after several dialogue texts are obtained, they are used as training data to perform training on a preset classification model until the classification model meets the first training condition, the training of the classification model is stopped, and the The classification model after the training is stopped is determined as the classification model after the stage training is completed.

对训练数据进行聚类分析，确定出多个训练类簇，并从多个训练类簇中确定出对应的场景类别。其中，可以预设第一聚类模型对训练数据进行第一聚类分析。例如，所述第一聚类模型可以采用K-means(k-means clustering algorithm，K均值聚类)算法。其可以根据聚类数量，将距离近的两个对话文本确定为一个类簇。由此，在对多个对话文本进行场景类别的统计分析时，可以预先确定第一聚类模型第一次输出的类簇数量m1，其中，所述类簇数量m1可以根据实际的场景类别数量n1确定，在此不作限定。但是，所述类簇数量m1需大于场景类别数量n1。其中，类簇数量m可以是场景类别数量n加上一个预设阈值。其中，预设阈值的范围可以设置为3-10之间。例如，明确初始的业务场景包括销售场景和售后场景时，确定对应的场景类别数量n1为2。则所述类簇数量m1可以为3、4、5或12等，从而能够避免聚类结果过于发散。Cluster analysis is performed on the training data, multiple training clusters are determined, and corresponding scene categories are determined from the multiple training clusters. Wherein, a first clustering model may be preset to perform a first cluster analysis on the training data. For example, the first clustering model may use a K-means (k-means clustering algorithm, K-means clustering) algorithm. According to the number of clusters, two dialogue texts with close distances can be determined as a cluster. Therefore, when performing statistical analysis of scene categories on multiple dialogue texts, the number m1 of clusters output by the first clustering model for the first time can be predetermined, wherein the number m1 of clusters can be based on the actual number of scene categories n1 is determined, which is not limited here. However, the number m1 of clusters needs to be greater than the number n1 of scene categories. The number of clusters m may be the number of scene categories n plus a preset threshold. The range of the preset threshold may be set to be between 3 and 10. For example, when it is specified that the initial business scenario includes a sales scenario and an after-sales scenario, the number n1 of the corresponding scenario categories is determined to be 2. Then the number of clusters m1 may be 3, 4, 5, or 12, etc., so that the clustering results can be prevented from being too divergent.

经过第一聚类模型的第一聚类分析之后，确定出m1个训练类簇。一种示例中，在确定出多个训练类簇之后，可以提供一展示页面对多个训练类簇进行展示，从而能够基于对筛选控件的触发，从多个训练类簇中筛选出场景类别明确的类簇，从而能够依据预设的业务场景对训练类簇进行类别标记。例如从3个训练类簇中筛选出2个业务场景明确的训练类簇分别进行类别标记。在进行类别标记时，可以在所述展示页面中提供一编辑控件以供用户自定义对应场景类别。在基于对编辑控件的触发，完成对训练类簇的类别标记。例如，在完成类别标注后，得到销售场景对应的文本数据和售后场景对应的文本数据。After the first cluster analysis of the first cluster model, m1 training clusters are determined. In one example, after multiple training clusters are determined, a display page can be provided to display the multiple training clusters, so that based on the triggering of the screening control, the scene category can be screened out from the multiple training clusters. class cluster, so that the training class cluster can be classified according to the preset business scenario. For example, 2 training clusters with clear business scenarios are selected from 3 training clusters for category labeling. When performing category marking, an editing control may be provided on the display page for the user to customize the corresponding scene category. Based on the triggering of the edit control, the class labeling of the training cluster is completed. For example, after the category labeling is completed, text data corresponding to the sales scene and text data corresponding to the after-sales scene are obtained.

通过对每个训练类簇进行场景类别的标注，从而使得进行类别标记的训练类簇中文本分别形成带有标注的文本数据。由此，避免了模型训练过程中需要对每个对话文本进行标注的步骤。大大降低了训练数据的标注量。然后将标注的文本数据对预设的分类模型进行训练，得到阶段训练完成的分类模型M1。所述分类模型可以是包括有N个隐藏层的BERT-base模型。在第一次对分类模型执行训练时，可以取第N个隐藏层输出的文本特征进行分类，分类对应的场景类别数量为n1，此时得到的分类模型即为所述阶段训练完成的分类模型M1。By labeling the scene category for each training cluster, the texts in the training clusters that are labelled respectively form the labeled text data. As a result, the step of labeling each dialogue text during model training is avoided. Greatly reduces the amount of annotations for training data. Then, the labeled text data is trained on the preset classification model, and the classification model M1 completed by the stage training is obtained. The classification model may be a BERT-base model including N hidden layers. When training the classification model for the first time, the text features output by the Nth hidden layer can be used for classification, and the number of scene categories corresponding to the classification is n1. M1.

一种可选的发明实施例，在将所述训练数据输入到第一聚类模型中进行第一聚类分析，输出多个训练类簇的过程中，首先可以对训练数据中的每个对话文本进行词语划分，确定出对应的若干个文本分词。依据各对话文本的文本分词的语义信息进行聚类分析。其中，在一些对话文本中，将某些文本分词所对应的语义信息进行归类，能够提高聚类分析的效果。例如，用户询问“快递到北京要多久”和“快递到上海要多久”的意图是一致的。上海和北京都可以归为地址类。由此在进行语义分析之前，先对若干个文本分词进行命名实体识别，从而确定出需要进行归类的目标命名实体。其中，所述目标命名实体至少包括以下其中一种：人名、机构名以及地名。在确定出目标命名实体之后，可以采用对应的目标关键词对目标命名实体进行替换。其中，目标关键词可以选用与目标命名实体有关联的词。In an optional embodiment of the invention, in the process of inputting the training data into the first clustering model to perform the first cluster analysis and outputting multiple training clusters, first of all, each dialogue in the training data can be analyzed. The text is divided into words, and several corresponding text segmentation words are determined. Cluster analysis is performed according to the semantic information of text segmentation of each dialogue text. Among them, in some dialogue texts, classifying the semantic information corresponding to some text segments can improve the effect of cluster analysis. For example, the user's intention to ask "how long does it take to express to Beijing" and "how long does it take to express to Shanghai" are the same. Both Shanghai and Beijing can be classified as addresses. Therefore, before performing semantic analysis, named entity recognition is performed on several text segments, so as to determine the target named entity that needs to be classified. Wherein, the target named entity includes at least one of the following: a person's name, an organization name, and a place name. After the target named entity is determined, the target named entity may be replaced with the corresponding target keyword. Among them, the target keywords can be words related to the target named entity.

在完成对目标命名实体的全部替换之后，并将经过替换后的若干个文本分词分别转换为文本词向量或文本TFIDF(term frequency–inverse document frequency，词频-逆文本频率指数)值。再将若干个文本词向量或文本TFIDF值输入到第一聚类模型中，执行上述的聚类分析操作。After all the replacement of the target named entity is completed, the replaced text segments are converted into text word vectors or text TFIDF (term frequency-inverse document frequency, term frequency-inverse text frequency index) values respectively. Then several text word vectors or text TFIDF values are input into the first clustering model, and the above clustering analysis operation is performed.

S203、基于所述训练数据确定差异信息。S203. Determine difference information based on the training data.

S204、提供一展示页面对所述差异信息进行展示，并获取基于所述展示页面的挖掘操作信息。S204. Provide a display page to display the difference information, and acquire mining operation information based on the display page.

S205、依据所述挖掘操作信息，确定所述阶段训练完成的分类模型是否符合场景挖掘条件。S205. According to the mining operation information, determine whether the classification model trained in the stage meets the scene mining conditions.

一种示例中，可以提供一展示页面，在所述展示页面中对所述差异信息进行展示，并在所述展示页面中设置有第一选择控件，基于所述第一选择控件的触发，获取所述展示页面中对应的挖掘操作信息。例如，所述挖掘操作信息可以包括是或否，当所述挖掘操作信息为是时，确定所述阶段训练完成的分类模型符合场景挖掘条件，执行步骤S206；当所述挖掘操作信息为否时，确定所述阶段训练完成的分类模型不符合场景挖掘条件，执行步骤S208。In an example, a display page may be provided, the difference information is displayed on the display page, and a first selection control is set on the display page, and based on the triggering of the first selection control, the acquisition is obtained. Corresponding mining operation information in the display page. For example, the mining operation information may include Yes or No. When the mining operation information is Yes, it is determined that the classification model trained in the stage meets the scene mining conditions, and step S206 is executed; when the mining operation information is No , it is determined that the classification model trained in the said stage does not meet the scene mining conditions, and step S208 is executed.

S206、将标注的文本数据输入到阶段训练完成的分类模型中，进行特征提取，得到对应的特征文本。S206: Input the marked text data into the classification model completed in the stage training, and perform feature extraction to obtain corresponding feature text.

S207、采用所述特征文本作为训练数据。S207, using the feature text as training data.

本发明实施例中，将所述标注的文本数据输入到阶段训练完成的分类模型中，进行特征提取。所述特征提取可以理解为通过阶段训练完成的分类模型中的隐藏层，直接输出对应的特征文本。一种示例中，在第一次进行特征提取时，将第1个隐藏层的输出特征，确定为对应的特征文本，并将所述特征文本作为新的训练数据。考虑到越接近输出层的隐藏层所输出的文本特征越准确。在执行第一次特征提取时，可以将第N-1个隐藏层输出的文本特征，确定为特征文本，并依据所述特征文本对所述训练数据进行更新。然后执行步骤S202，以通过更新后的训练数据对所述阶段训练完成的分类模型继续进行训练。In the embodiment of the present invention, the marked text data is input into the classification model after the stage training is completed, and feature extraction is performed. The feature extraction can be understood as the hidden layer in the classification model completed by stage training, and the corresponding feature text is directly output. In an example, when the feature extraction is performed for the first time, the output feature of the first hidden layer is determined as the corresponding feature text, and the feature text is used as new training data. Considering that the text features output by the hidden layer closer to the output layer are more accurate. When performing the first feature extraction, the text feature output by the N-1th hidden layer can be determined as feature text, and the training data can be updated according to the feature text. Then, step S202 is performed to continue training the classification model that has been trained in the stage through the updated training data.

在将特征文本作为新的训练数据后，再依据所述训练数据对分类模型执行训练。由于需要对场景类别进一步挖掘，在依据第一聚类模型对训练数据进行第一聚类分析时，调整所述第一聚类模型输出的类簇数量为m2，其中，所述类簇数量m2大于场景类别数量n1。经过第一聚类模型的第一聚类分析之后，确定出m2个训练类簇。在所述展示页面对m2个训练类簇进行展示，从而能够基于对筛选控件的触发，从m2训练类簇中筛选出场景类别明确的类簇，从而能够依据对训练类簇进行场景类别的标记，此时得到场景类别数量n2，并且，场景类别数量n2大于场景类别数量n1。例如在销售场景和售后场景的基础上，对场景类别进一步划分为：产品咨询场景、同意购买场景、同意售后场景、拒绝换货场景等。After the feature text is used as new training data, the classification model is trained according to the training data. Since the scene category needs to be further mined, when the first clustering analysis is performed on the training data according to the first clustering model, the number of clusters output by the first clustering model is adjusted to m2, wherein the number of clusters m2 greater than the number of scene categories n1. After the first cluster analysis of the first cluster model, m2 training clusters are determined. The m2 training clusters are displayed on the display page, so that based on the triggering of the screening control, clusters with clear scene categories can be selected from the m2 training clusters, so that the scene categories can be marked according to the training clusters. , the number n2 of scene categories is obtained at this time, and the number n2 of scene categories is greater than the number n1 of scene categories. For example, on the basis of sales scenarios and after-sales scenarios, the scenario categories are further divided into: product consultation scenarios, purchase consent scenarios, after-sales scenarios consent, and refusal to exchange items, etc.

通过对每个训练类簇进行场景类别的标注，从而使得进行类别标记的训练类簇中文本分别形成带有标注的文本数据。然后将标注的文本数据对阶段训练完成的分类模型M1进行继续训练，得到阶段训练完成的分类模型M2，其可以分类出n2个场景类别，从而使得阶段训练完成的分类模型M2相比于阶段训练完成的分类模型M1，增加了场景类别数量。By labeling the scene category for each training cluster, the texts in the training clusters that are labelled respectively form the labeled text data. Then, the labeled text data is used to continue training the classification model M1 completed in the stage training, and the classification model M2 completed in the stage training is obtained, which can classify n2 scene categories, so that the classification model M2 completed in the stage training is compared with the stage training. The completed classification model M1 increases the number of scene categories.

然后基于用于训练阶段训练完成的分类模型M2的训练数据确定差异信息，再依据所述差异信息判断阶段训练完成的分类模型M2是否符合场景挖掘条件，确定所述阶段训练完成的分类模型M2符合场景挖掘条件时，执行步骤S206；确定所述阶段训练完成的分类模型M2不符合场景挖掘条件时，执行步骤S208。例如，在确定所述阶段训练完成的分类模型M2符合场景挖掘条件时，对场景类别需进一步细化。如产品咨询场景更进一步划分为：询问价格场景、询问尺码场景、询问款式场景等。以此类推，最后得到训练完成的分类模型。将训练完成的分类模型应用于文本挖掘中，能够对客户的对话文本实现精确分类，并能够基于精确分类得到的场景类别确定匹配度高的目标回复文本，以能够提高业务人员的服务水平，提高用户体验。Then, the difference information is determined based on the training data of the classification model M2 that is trained in the training stage, and then it is determined according to the difference information whether the classification model M2 completed in the stage training meets the scene mining conditions, and it is determined that the classification model M2 completed in the stage training meets the When the scene mining conditions are met, step S206 is performed; when it is determined that the classification model M2 completed in the training phase does not meet the scene mining conditions, step S208 is performed. For example, when it is determined that the classification model M2 trained in the said stage meets the conditions for scene mining, the scene category needs to be further refined. For example, the product consultation scene is further divided into: price inquiry scene, size inquiry scene, style inquiry scene, etc. And so on, and finally get the trained classification model. Applying the trained classification model to text mining can accurately classify the customer's dialogue text, and determine the target reply text with high matching degree based on the scene category obtained by the accurate classification, so as to improve the service level of business personnel and improve user experience.

S208、将所述阶段训练完成的分类模型作为训练完成的分类模型。S208. Use the classification model that has been trained at the stage as the classification model that has been trained.

本发明实施例中，确定所述差异信息不符合场景挖掘条件，也就是分类效果明显，因此将所述阶段训练完成的分类模型直接作为训练完成的分类模型。In the embodiment of the present invention, it is determined that the difference information does not meet the scene mining conditions, that is, the classification effect is obvious, so the classification model completed in the training phase is directly used as the classification model after the training.

参照图3，示出了本发明实施例提供的一种文本挖掘方法的步骤流程图，所述方法可以包括：Referring to FIG. 3, a flowchart of steps of a text mining method provided by an embodiment of the present invention is shown, and the method may include:

S301、接收对话信息，从所述对话信息中获取第一用户的对话文本。S301. Receive dialog information, and obtain a dialog text of a first user from the dialog information.

S302、将所述对话文本输入到分类模型中进行分类识别，确定出对应的目标场景类别，所述分类模型通过训练数据执行训练，得到阶段训练完成的分类模型，并基于所述训练数据确定差异信息，判断所述阶段训练完成的分类模型是否符合场景挖掘条件，依据判断结果确定是否更新训练数据继续训练阶段训练完成的分类模型得到。S302. Input the dialogue text into a classification model for classification and recognition, and determine the corresponding target scene category, and the classification model is trained by training data to obtain a classification model completed by stage training, and determine the difference based on the training data. information, judging whether the classification model completed in the training stage meets the scene mining conditions, and determining whether to update the training data according to the judgment result to continue to obtain the classification model trained in the training stage.

S303、查询所述目标场景类别对应的目标回复文本。S303 , query the target reply text corresponding to the target scene category.

S304、采用所述目标回复文本作为第二用户的对话文本，反馈所述第一用户的对话文本。S304 , using the target reply text as the dialogue text of the second user, and feeding back the dialogue text of the first user.

本发明实施例中，所述分类模型通过上述分类模型训练方法训练完得到。在获取到对话信息后，从所述对话信息中获取第一用户的对话文本。将第一用户的对话文本输入到所述分类模型中进行分类识别，得到第一用户的对话文本所对应的目标场景类别。然后在对话记录中获取第二用户的历史对话文本，其中，第二用户的历史对话文本所对应的场景类别已经过标注，其中，第二用户的历史对话文本所对应的场景类别数量和种类，与所述分类模型分类识别的场景类别数量和种类保持一致。In the embodiment of the present invention, the classification model is obtained after being trained by the above-mentioned classification model training method. After the dialogue information is acquired, the dialogue text of the first user is acquired from the dialogue information. The dialogue text of the first user is input into the classification model for classification and recognition, and a target scene category corresponding to the dialogue text of the first user is obtained. Then, the historical dialogue text of the second user is obtained in the dialogue record, wherein the scene category corresponding to the historical dialogue text of the second user has been marked, wherein the number and type of scene categories corresponding to the historical dialogue text of the second user, It is consistent with the number and type of scene categories classified and identified by the classification model.

因此，在确定出第一用户的对话文本对应的目标场景类别后，在数据库中查询出所述目标场景类别下的第二用户的历史对话文本，并确定为目标回复文本，并将所述目标回复文本作为第二用户的对话文本，用于第一用户的对话文本的反馈。其中，所述目标回复文本可以为多条，其可以采用至少一个类簇的方式进行展示。一种示例中，在所述目标回复文本为多条时，可以通过对显示页面中选择控件的点选操作，确定其中一条目标回复文本作为第二用户的对话文本，并反馈给第一用户。因此，即使是资历尚浅的业务人员，也可以从第二用户的历史对话文本中，挖掘出最适合回应第一用户的对话文本的话术。从而在便于业务人员学习，提高服务水平的同时，还提高了第一用户的用户体验。Therefore, after determining the target scene category corresponding to the dialogue text of the first user, the historical dialogue text of the second user under the target scene category is queried in the database, and determined as the target reply text, and the target The reply text is used as the dialogue text of the second user for feedback of the dialogue text of the first user. There may be multiple pieces of the target reply text, which may be displayed in the form of at least one cluster. In an example, when there are multiple pieces of the target reply text, one of the target reply texts may be determined as the dialogue text of the second user by clicking on the selection control on the display page, and fed back to the first user. Therefore, even an inexperienced business person can dig out the most suitable words for responding to the dialogue text of the first user from the historical dialogue text of the second user. Therefore, the learning of the business personnel is facilitated, the service level is improved, and the user experience of the first user is also improved.

另一种示例中，若同一目标场景类别下的目标回复文本的条数较多时，为了便于挖掘出与众不同的回复方式，可以采用预设的第二聚类模型对若干条目标回复文本进行第二聚类分析。其中，第二聚类模型输出的类簇数量，随目标回复文本的语义信息的不同而动态变化。其可以采用DBSCAN(Density-Based Spatial Clustering of Applications withNoise，基于密度的噪声应用空间聚类)算法。由此，在第二聚类分析过程中，不需要设定具体的类簇数量，得到若干个不同回复方式的目标回复文本。从而可以通过对显示页面中选择控件的点选操作，确定其中一条目标回复文本作为第二用户的对话文本，并反馈给第一用户。In another example, if there are a large number of target reply texts under the same target scene category, in order to excavate different reply methods, a preset second clustering model can be used to analyze several target reply texts. The second cluster analysis. Among them, the number of clusters output by the second clustering model changes dynamically with the semantic information of the target reply text. It can use the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm. Therefore, in the second cluster analysis process, it is not necessary to set a specific number of clusters, and several target reply texts with different reply modes are obtained. Therefore, one of the target reply texts can be determined as the dialogue text of the second user by clicking on the selection control on the display page, and fed back to the first user.

一种示例中，参照图4和图5，在第一用户的第一客户端和第二用户的第二客户端分别提供一显示页面对所述对话消息进行显示。其中，在所述第二用户对应的显示页面中，设置有与第一用户的每条对话文本对应的分类控件(如页面中间的圆圈)，依据对所述分类控件的触发，生成分类指令发送至服务端。将所述分类控件对应的对话文本输入到分类模型中进行分类识别，确定出目标场景类别病确定出目标回复文本后，将所述目标回复文本发送至第二客户端。在第二用户对应的显示页面中显示对应的目标回复文本。参照图6，每条目标回复文本均匹配有一回复选择控件(如页面中间部分的箭头)，基于对所述回复选择控件的触发，将所述回复选择控件对应的目标回复文本作为第二用户的对话文本，发送给第一户端。In an example, referring to FIG. 4 and FIG. 5 , a display page is respectively provided on the first client of the first user and the second client of the second user to display the dialog message. Wherein, in the display page corresponding to the second user, a classification control (such as a circle in the middle of the page) corresponding to each dialogue text of the first user is set, and according to the triggering of the classification control, a classification instruction is generated to send to the server. The dialog text corresponding to the classification control is input into the classification model for classification and identification, and after the target scene category is determined and the target reply text is determined, the target reply text is sent to the second client. The corresponding target reply text is displayed on the display page corresponding to the second user. 6, each target reply text is matched with a reply selection control (such as the arrow in the middle part of the page), and based on the triggering of the reply selection control, the target reply text corresponding to the reply selection control is used as the second user's Conversation text, sent to the first client.

另一种示例中，所述第一客户端和第二客户端可以是同一客户端。参照图7，用户可以在客户端的显示页面中，输入对话文本，此对话文本默认为第一用户的对话文本。基于对显示页面中分类控件的触发，将第一用户的对话文本输入到所述分类模型中进行分类识别，得到第一用户的对话文本所对应的目标场景类别。然后在对话记录中获取第二用户的历史对话文本，其中，第二用户的历史对话文本所对应的场景类别已经过标注，其中，第二用户的历史对话文本所对应的场景类别数量和种类，与所述分类模型分类识别的场景类别数量和种类保持一致。In another example, the first client and the second client may be the same client. Referring to FIG. 7 , the user may enter a dialog text on the display page of the client, and the dialog text is the dialog text of the first user by default. Based on the triggering of the classification control on the display page, the dialogue text of the first user is input into the classification model for classification and recognition, and the target scene category corresponding to the dialogue text of the first user is obtained. Then, the historical dialogue text of the second user is obtained in the dialogue record, wherein the scene category corresponding to the historical dialogue text of the second user has been marked, wherein the number and type of scene categories corresponding to the historical dialogue text of the second user, It is consistent with the number and type of scene categories classified and identified by the classification model.

因此，在确定出第一用户的对话文本对应的目标场景类别后，在数据库中查询出所述目标场景类别下的第二用户的历史对话文本，并确定为目标回复文本，并将所述目标回复文本作为第二用户的对话文本，用于第一用户的对话文本的反馈，在用户的客户端中对所述第二用户的对话文本进行展示，参照图8。由此，在新员工的业务培训学习场景中，新员工可以自定义第一用户的对话文本，从而能够基于不同的第一用户的对话文本得到的目标回复文本，学习到不同的话术。这些目标回复文本可以为新员工培训或老员工互相学习，提供素材。Therefore, after determining the target scene category corresponding to the dialogue text of the first user, the historical dialogue text of the second user under the target scene category is queried in the database, and determined as the target reply text, and the target The reply text is used as the dialogue text of the second user, and is used for feedback of the dialogue text of the first user, and the dialogue text of the second user is displayed in the user's client terminal, referring to FIG. 8 . Therefore, in the business training and learning scenario of the new employee, the new employee can customize the dialogue text of the first user, so as to learn different vocabulary based on the target reply text obtained from the dialogue text of different first users. These target response texts can provide material for new employee training or for old employees to learn from each other.

需要说明的是，对于方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本申请实施例并不受所描述的动作顺序的限制，因为依据本申请实施例，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作并不一定是本申请实施例所必须的。It should be noted that, for the sake of simple description, the method embodiments are expressed as a series of action combinations, but those skilled in the art should know that the embodiments of the present application are not limited by the described action sequence, because According to the embodiments of the present application, certain steps may be performed in other sequences or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present application.

参照图9，示出了本发明实施例提供的一种分类模型训练装置，所述装置可以包括：Referring to FIG. 9, an apparatus for training a classification model provided by an embodiment of the present invention is shown, and the apparatus may include:

数据获取模块901，用于获取多个对话文本作为训练数据。The data acquisition module 901 is used for acquiring multiple dialogue texts as training data.

模型阶段训练模块902，用于依据所述训练数据对分类模型执行训练，得到阶段训练完成的分类模型。The model stage training module 902 is configured to perform training on the classification model according to the training data to obtain the classification model after stage training.

条件判断模块903，用于基于所述训练数据确定差异信息，依据所述差异信息判断所述阶段训练完成的分类模型是否符合场景挖掘条件。The condition judgment module 903 is configured to determine difference information based on the training data, and judge whether the classification model trained in the stage meets the scene mining conditions according to the difference information.

第一训练模块904，用于若所述差异信息符合场景挖掘条件，则依据所述训练数据和阶段训练完成的分类模型更新训练数据并继续训练。The first training module 904 is configured to update the training data and continue the training according to the training data and the classification model completed by the stage training if the difference information meets the scene mining conditions.

第二训练模块905，用于若所述差异信息不符合场景挖掘条件，则将所述阶段训练完成的分类模型作为训练完成的分类模型。The second training module 905 is configured to, if the difference information does not meet the scene mining conditions, use the classification model trained in the stage as the trained classification model.

其中，所述模型阶段训练模块还用于：Wherein, the model stage training module is also used for:

一种可选地发明实施例，基于所述训练数据确定差异信息，包括：An optional inventive embodiment, determining the difference information based on the training data, includes:

对不同场景类别对应训练类簇的文本进行分析，确定差异信息。Analyze the text of the training clusters corresponding to different scene categories to determine the difference information.

一种可选地发明实施例，所述第一训练模块可以包括：In an optional inventive embodiment, the first training module may include:

特征提取子模块，用于将标注的文本数据输入到阶段训练完成的分类模型中，进行特征提取，得到对应的特征文本。The feature extraction sub-module is used to input the marked text data into the classification model completed in the stage training, and perform feature extraction to obtain the corresponding feature text.

数据更新子模块，用于采用所述特征文本作为训练数据。The data update submodule is used for using the feature text as training data.

一种可选地发明实施例，所述模型阶段训练模块还用于：In an optional embodiment of the invention, the model stage training module is further used for:

将所述训练数据输入到第一聚类模型中进行第一聚类分析，输出多个训练类簇，其中，所述训练类簇的数量通过所述第一聚类模型预先设定。The training data is input into a first clustering model to perform a first clustering analysis, and a plurality of training clusters is output, wherein the number of the training clusters is preset by the first clustering model.

一种可选地发明实施例，所述模型阶段训练模块可以包括：In an optional inventive embodiment, the model stage training module may include:

分词子模块，用于对所述训练数据进行词语划分，确定出对应的若干个文本分词。The word segmentation sub-module is used to divide the training data into words, and determine several corresponding text segmentations.

分词转换子模块，用于将若干个文本分词分别转换为文本词向量或文本TFIDF值。The word segmentation conversion sub-module is used to convert several text segmentations into text word vectors or text TFIDF values respectively.

类簇输出子模块，用于将若干个文本词向量或文本TFIDF值输入到第一聚类模型中进行第一聚类分析，输出若干个训练类簇。The cluster output sub-module is used to input several text word vectors or text TFIDF values into the first clustering model to perform the first cluster analysis, and output several training clusters.

一种可选地发明实施例，所述类簇输出子模块还可以包括：In an optional inventive embodiment, the cluster-like output submodule may further include:

识别单元，用于对若干个文本分词进行命名实体识别，确定出若干个文本分词中的目标命名实体，所述目标命名实体至少包括以下其中一种：人名、机构名以及地名。The recognition unit is used for performing named entity recognition on several text segments, and determining target named entities in several text segments, and the target named entities include at least one of the following: a person's name, an institution name and a place name.

替换单元，用于采用目标关键词对目标命名实体进行替换，并将经过替换后的若干个文本分词分别转换为文本TFIDF值。The replacement unit is used to replace the target named entity with the target keyword, and convert several text word segmentations after the replacement into text TFIDF values respectively.

一种可选地发明实施例，所述方法还可以包括：In an optional inventive embodiment, the method may further include:

信息展示模块，用于提供一展示页面对所述差异信息进行展示，并获取基于所述展示页面的挖掘操作信息。The information display module is configured to provide a display page to display the difference information, and obtain mining operation information based on the display page.

条件确定模块，用于依据所述挖掘操作信息，确定所述阶段训练完成的分类模型是否符合场景挖掘条件。A condition determination module is configured to determine, according to the mining operation information, whether the classification model trained in the stage meets the scene mining conditions.

参照图10，示出了本发明实施例提供的一种文本挖掘装置，所述装置可以包括：Referring to FIG. 10, a text mining apparatus provided by an embodiment of the present invention is shown, and the apparatus may include:

对话接收模块1001，用于接收对话信息，从所述对话信息中获取第一用户的对话文本。The dialogue receiving module 1001 is configured to receive dialogue information, and obtain the dialogue text of the first user from the dialogue information.

场景识别模块1002，用于将所述对话文本输入到分类模型中进行分类识别，确定出对应的目标场景类别，所述分类模型通过训练数据执行训练，得到阶段训练完成的分类模型，并基于所述训练数据确定差异信息，判断所述阶段训练完成的分类模型是否符合场景挖掘条件，依据判断结果确定是否更新训练数据继续训练阶段训练完成的分类模型得到。The scene recognition module 1002 is used to input the dialogue text into the classification model for classification and recognition, and determine the corresponding target scene category. The training data is used to determine the difference information, and it is judged whether the classification model completed in the training stage meets the scene mining conditions, and whether to update the training data according to the judgment result is obtained by continuing the classification model trained in the training stage.

文本查询模块1003，用于查询所述目标场景类别对应的目标回复文本。The text query module 1003 is configured to query the target reply text corresponding to the target scene category.

文本反馈模块1004，用于采用所述目标回复文本作为第二用户的对话文本，反馈所述第一用户的对话文本。The text feedback module 1004 is configured to use the target reply text as the dialogue text of the second user, and feed back the dialogue text of the first user.

本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments may be referred to each other.

本领域技术人员易于想到的是：上述各个实施例的任意组合应用都是可行的，故上述各个实施例之间的任意组合都是本发明的实施方案，但是由于篇幅限制，本说明书在此就不一一详述了。It is easy for those skilled in the art to think that any combination of the above-mentioned embodiments is feasible, so any combination of the above-mentioned embodiments is an embodiment of the present invention, but due to space limitations, this description is hereby limited to Not detailed.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. It will be understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

类似地，应当理解，为了精简本发明并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it is to be understood that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together into a single embodiment, figure, or its description. This disclosure, however, should not be construed as reflecting an intention that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art will understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and further they may be divided into multiple sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method so disclosed may be employed in any combination, unless at least some of such features and/or procedures or elements are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

一种电子设备，包括：An electronic device comprising:

一个或多个处理器；one or more processors;

存储器；memory;

一个或多个程序，其中所述一个或多个程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行，所述一个或多个程序配置用于执行上述实施例所述的方法。one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs are configured to perform the above-described embodiments the method described.

一种计算机可读存储介质，存储与电子设备结合使用的计算机程序，所述计算机程序可被处理器执行以完成上述实施例所述的方法。A computer-readable storage medium storing a computer program used in conjunction with an electronic device, where the computer program can be executed by a processor to complete the methods described in the above embodiments.

本领域内的技术人员应明白，本发明实施例的实施例可提供为方法、装置、或计算机程序产品。因此，本发明实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It should be understood by those skilled in the art that the embodiments of the embodiments of the present invention may be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product implemented on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.

本发明实施例是参照根据本发明实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。Embodiments of the present invention are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal equipment to produce a machine that causes the instructions to be executed by the processor of the computer or other programmable data processing terminal equipment Means are created for implementing the functions specified in the flow or flows of the flowcharts and/or the blocks or blocks of the block diagrams.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing terminal equipment to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising instruction means, the The instruction means implement the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上，使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operational steps are performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby executing on the computer or other programmable terminal equipment The instructions executed on the above provide steps for implementing the functions specified in the flowchart or blocks and/or the block or blocks of the block diagrams.

尽管已描述了本发明实施例的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例做出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本发明实施例范围的所有变更和修改。Although preferred embodiments of the embodiments of the present invention have been described, additional changes and modifications to these embodiments may be made by those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be construed to include the preferred embodiments as well as all changes and modifications that fall within the scope of the embodiments of the present invention.

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion such that a process, method, article or terminal device that includes a list of elements includes not only those elements, but also a non-exclusive list of elements. other elements, or also include elements inherent to such a process, method, article or terminal equipment. Without further limitation, an element defined by the phrase "comprises a..." does not preclude the presence of additional identical elements in the process, method, article, or terminal device that includes the element.

以上对本发明所提供的一种分类模型训练方法、一种文本挖掘方法、一种分类模型训练装置和一种文本挖掘装置，进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。A classification model training method, a text mining method, a classification model training device and a text mining device provided by the present invention have been described in detail above. Specific examples are used in this paper to explain the principles and implementation of the present invention. The description of the above embodiment is only used to help understand the method of the present invention and its core idea; meanwhile, for those of ordinary skill in the art, according to the idea of the present invention, both in the specific embodiment and application scope will be There are changes. In summary, the contents of this specification should not be construed as limiting the present invention.

Claims

1. a classification model training method, is characterized in that, described method comprises:

Obtain multiple dialogue texts as training data;

Perform training on the classification model according to the training data, and obtain the classification model completed by stage training;

Determine difference information based on the training data, and judge whether the classification model trained in the stage meets the scene mining conditions according to the difference information;

If the difference information meets the scene mining conditions, update the training data and continue training according to the training data and the classification model completed by the stage training;

If the difference information does not meet the scene mining conditions, the classification model trained in the stage is used as the classification model after the training;

Wherein, the classification model is trained according to the training data, and the classification model completed by stage training is obtained, including:

Perform cluster analysis on the training data to determine a plurality of training clusters;

Determine the scene category corresponding to the training cluster, and mark the text in the training cluster as the scene category;

Use the labeled text data to train the classification model, and obtain the classification model that has been trained in stages.

2. The classification model training method according to claim 1, wherein determining difference information based on the training data, comprising:

Analyze the text of the training clusters corresponding to different scene categories to determine the difference information.

3. The classification model training method according to claim 1, wherein the updating training data according to the classification model completed by the training data and stage training, comprises:

Input the marked text data into the classification model completed in the stage training, and perform feature extraction to obtain the corresponding feature text;

The feature text is used as training data.

4. The classification model training method according to claim 1, wherein the said training data is subjected to cluster analysis to determine a plurality of training clusters, comprising:

The training data is input into a first clustering model to perform a first clustering analysis, and a plurality of training clusters is output, wherein the number of the training clusters is preset by the first clustering model.

5. The classification model training method according to claim 4, wherein the training data is input into the first clustering model to carry out the first cluster analysis, and a plurality of training clusters are output, comprising:

The training data is divided into words, and several corresponding text segmentations are determined;

Convert several text segments into text word vectors or text TFIDF values respectively;

Several text word vectors or text TFIDF values are input into the first clustering model to perform the first cluster analysis, and several training clusters are output.

6. classification model training method according to claim 5, is characterized in that, described converting several text segmentations into text TFIDF values respectively, comprises:

Perform named entity recognition on several text segments, and determine target named entities in several text segments, and the target named entities include at least one of the following: a person's name, an organization name, and a place name;

The target named entity is replaced by the target keyword, and the replaced text segments are converted into text TFIDF values respectively.

7. The classification model training method according to claim 2, wherein the method further comprises:

providing a display page to display the difference information, and acquiring mining operation information based on the display page;

According to the mining operation information, it is determined whether the classification model trained in the stage meets the scene mining conditions.

8. A text mining method, wherein the method comprises:

receiving dialogue information, and obtaining the dialogue text of the first user from the dialogue information;

Inputting the dialogue text into a classification model for classification and identification, and determining the corresponding target scene category, the classification model performs training through the training data to obtain a classification model completed by stage training, and determines the difference information based on the training data, Judging whether the classification model trained in the stage meets the scene mining conditions, and determining whether to update the training data according to the judgment result and continuing the classification model trained in the training stage to obtain;

query the target reply text corresponding to the target scene category;

The target reply text is used as the dialogue text of the second user, and the dialogue text of the first user is fed back.

9. An electronic device comprising:

one or more processors;

memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs are configured to perform claim 1 The method of any one of -8.

10. A computer-readable storage medium storing a computer program for use in conjunction with an electronic device, the computer program being executable by a processor to perform the method of any one of claims 1-8.