[go: up one dir, main page]

CN115098680B - Data processing method, device, electronic equipment, medium and program product - Google Patents

Data processing method, device, electronic equipment, medium and program product Download PDF

Info

Publication number
CN115098680B
CN115098680B CN202210756062.XA CN202210756062A CN115098680B CN 115098680 B CN115098680 B CN 115098680B CN 202210756062 A CN202210756062 A CN 202210756062A CN 115098680 B CN115098680 B CN 115098680B
Authority
CN
China
Prior art keywords
text data
sample
target
classification model
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210756062.XA
Other languages
Chinese (zh)
Other versions
CN115098680A (en
Inventor
蔡浩锐
李健
蔡超维
李利强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210756062.XA priority Critical patent/CN115098680B/en
Publication of CN115098680A publication Critical patent/CN115098680A/en
Application granted granted Critical
Publication of CN115098680B publication Critical patent/CN115098680B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a data processing method, a data processing device, electronic equipment, a medium and a program product, which are applied to the technical field of computers. The method comprises the following steps: the method comprises the steps of obtaining sample text data, sequentially carrying out sliding segmentation on N sub-texts based on a sliding window to obtain a plurality of candidate text data, carrying out pre-training on a classification model based on the sample text data to obtain a pre-trained classification model, calling the pre-trained classification model to carry out classification prediction on the plurality of candidate text data, selecting target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model on the plurality of candidate text data, and carrying out training on the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model. By adopting the embodiment of the application, the accuracy of the trained classification model can be improved, and further, more accurate classification prediction can be carried out on the text data through the trained classification model.

Description

Data processing method, device, electronic equipment, medium and program product
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a data processing method, apparatus, electronic device, medium, and program product.
Background
With the continuous development of computer technology, artificial Intelligence (AI) technology is becoming mature, wherein the artificial intelligence technology relates to related technology of machine learning.
In the prior art, a model can be trained through a machine learning related technology, and the trained model can be applied to classification prediction of text data of a specified business scene. For example, session text data of the target object is obtained, and the risk of the session text data is classified by using the model, such as classifying whether malicious text is included in the session text data of the target object. When the training set is actually constructed, the situation that the number of the samples which can be acquired under the current service scene is small may exist, if the model is trained by using the small sample training set, the model is easy to train and fit, and the accuracy of the model obtained by training is low.
Disclosure of Invention
The embodiment of the application provides a data processing method, a device, electronic equipment, a medium and a program product, which can improve the accuracy of a trained classification model, and further can carry out more accurate classification prediction on text data through the trained classification model.
In one aspect, an embodiment of the present application provides a data processing method, where the method includes:
acquiring sample text data; the number of the sample text data is smaller than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
Sequentially carrying out sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data; any one candidate text data comprises at least one sub-text in any continuous form of N sub-texts;
Pre-training the classification model based on the sample text data to obtain a pre-trained classification model;
Invoking a pre-trained classification model to classify and predict a plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data;
Training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for classifying and predicting the text data.
In one aspect, an embodiment of the present application provides a data processing apparatus, including:
The acquisition module is used for acquiring sample text data; the number of the sample text data is smaller than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
The processing module is used for sequentially carrying out sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data; any one candidate text data comprises at least one sub-text in any continuous form of N sub-texts;
the processing module is also used for training the classification model based on the sample text data to obtain a pre-trained classification model;
The processing module is also used for calling a pre-trained classification model to classify and predict the plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data;
the processing module is also used for training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for classifying and predicting the text data.
In one aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory is configured to store a computer program, the computer program including program instructions, and the processor is configured to invoke the program instructions to perform some or all of the steps in the above method.
In one aspect, embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions for performing part or all of the steps of the above method when executed by a processor.
Accordingly, according to one aspect of the present application, there is provided a computer program product or computer program comprising program instructions stored in a computer readable storage medium. The processor of the computer device reads the program instructions from the computer-readable storage medium, and the processor executes the program instructions, so that the computer device performs the data processing method provided above.
According to the embodiment of the application, sample text data can be obtained, and N sub-texts are sequentially subjected to sliding segmentation based on the sliding window, so that a plurality of candidate text data are obtained; the candidate text data are obtained by sample expansion of sample text data, so that the number of samples can be increased, and the problem of overfitting possibly caused by a small sample training set can be avoided; pre-training the classification model based on the sample text data to obtain a pre-trained classification model; invoking a pre-trained classification model to classify and predict a plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data; based on the prediction accuracy, selecting target text data with higher quality from the candidate text data; training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the model training can be positively effected by the high-quality target text data and sample text data, so that the prediction accuracy of the trained classification model can be improved, and further, the text data can be more accurately classified and predicted by the trained classification model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an application architecture according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;
fig. 4a is a schematic view of a scenario for obtaining candidate text data according to an embodiment of the present application;
fig. 4b is a schematic view of a scenario for obtaining candidate text data according to an embodiment of the present application;
fig. 4c is a schematic view of a scenario for obtaining candidate text data according to an embodiment of the present application;
FIG. 5 is a schematic flow chart of training a classification model according to an embodiment of the present application;
FIG. 6a is a schematic flow chart of an application classification model according to an embodiment of the present application;
FIG. 6b is a schematic flow chart of a feature engineering process according to an embodiment of the present application;
fig. 7 is a schematic diagram of a risk early warning scenario based on a classification model according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
The data processing method provided by the embodiment of the application is implemented in the electronic equipment, and the electronic equipment can be a server or a terminal. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, an aircraft, and the like. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.
Next, technical terms related to the technical field of possible application of the scheme of the embodiment of the present application are described in the following:
1. Artificial intelligence:
The embodiment of the application relates to the technical field of machine learning (MACHINE LEARNING, ML) in artificial intelligence, which is a multi-field interdisciplinary and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The classification model in the technical scheme of the application can be trained based on a machine learning technology.
In some embodiments, please refer to fig. 1, fig. 1 is a schematic diagram of an application architecture according to an embodiment of the present application, through which the data processing method according to the present application may be executed. As shown in fig. 1, an electronic device may be included in which a classification model to be trained is deployed; the electronic device may acquire a sample set, where the sample set includes sample text data, the sample text data includes N sub-texts, pretraining is performed on a classification model by using the sample text data, and data segmentation is performed on the sample text data to obtain multiple candidate text data, so as to implement sample set expansion, and the data segmentation may be implemented based on a sliding window, call a pretrained classification model to perform classification prediction on the multiple candidate text data, so as to select target text data from the multiple candidate text data according to a prediction result for the multiple candidate text data, where the prediction result characterizes prediction accuracy of the model for the multiple candidate text data, and train the pretrained classification model based on the sample text data and the target text data, so as to obtain a trained classification model.
It should be understood that fig. 1 is merely exemplary to represent possible application architectures of the technical solution of the present application, and is not limited to specific architectures of the technical solution of the present application, that is, the technical solution of the present application may also provide other application architectures.
Optionally, in some embodiments, the electronic device may perform the data processing method according to actual service requirements to improve the prediction accuracy of the obtained model. The technical scheme of the application can be applied to classification scenes of any text data. For example, the text data may be session text data of the target object, and the classification for the session text data may be a risk classification, such as a classification of whether the target object contacts malicious information (or may be called malicious text, etc.), where the classification result may indicate that the target object has contacted malicious information or has not contacted malicious information, for example. It will be appreciated that when the classification result indicates that malicious information has been contacted, it may be indicated as having a risk of a session; when the classification result indicates that malicious information is not contacted, it may be indicated that there is no risk of a session. The electronic equipment can acquire sample session text data carrying risk classification labels, sample expansion is carried out on the sample session text data according to the method provided by the technical scheme of the application, and a classification model capable of carrying out risk classification is obtained by training on the basis of the sample session text data and the acquired candidate session text data.
As another example, the text data may be social text data of a target conversation, and the classification for the social text data may be an emotion classification, such as a classification of emotional tendency of the target object (or a classification of what may be called emotional text), where the classification result may be, for example, positive, negative, or the like. The electronic equipment can acquire sample social text data carrying emotion classification labels, sample expansion is carried out on the sample social text data according to the method provided by the technical scheme of the application, and a classification model capable of carrying out emotion classification is obtained by training on the basis of the sample social text data and the acquired candidate social text data.
Optionally, the data related to the present application, such as sample text data, candidate text data, etc., may be stored in a database, or may be stored in a blockchain, such as by a blockchain distributed system, and the present application is not limited thereto.
It should be noted that, in the specific embodiment of the present application, related data such as user information and the like, for example, user data (such as session data, social data and the like) that needs to be acquired when a sample set is constructed or when a model is actually applied, when the above embodiment of the present application is applied to a specific product or technology, user permission or consent needs to be obtained, and collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
It can be understood that the above scenario is merely an example, and does not constitute a limitation on the application scenario of the technical solution provided by the embodiment of the present application, and the technical solution of the present application may also be applied to other scenarios. For example, as one of ordinary skill in the art can know, with the evolution of the system architecture and the appearance of new service scenarios, the technical solution provided by the embodiment of the present application is also applicable to similar technical problems.
The scheme provided by the embodiment of the application relates to the technology of artificial intelligence such as machine learning, and the like, and is specifically described by the following embodiments:
based on the above description, the embodiments of the present application provide a data processing method, which may be performed by the above-mentioned electronic device. Referring to fig. 2, fig. 2 is a flow chart of a data processing method according to an embodiment of the application. As shown in fig. 2, the flow of the data processing method according to the embodiment of the present application may include the following:
s201, acquiring sample text data.
In some embodiments, the number of sample text data is less than the number of sample indices. The number of sample indices may be set manually and may be used to distinguish between small sample training sets and non-small sample training sets. When the number of samples in one sample set is smaller than the number of sample indexes, the sample set is indicated to be a small sample training set. I.e. the acquired sample text data belongs to a small sample training set. Therefore, the training method can train the classification model through the sample text data, realize model training in a small sample scene, and improve model training effect and prediction accuracy in the small sample scene based on the training method provided by the embodiment of the application.
In some embodiments, the sample text data may be text of any service type, for example, may be text data of a conversation (such as text data obtained by transferring a conversation record collected when the outbound robot makes an intelligent outbound with a sample object (such as a user), the conversation on the outbound robot side and the conversation on the user side in the text data are respectively marked), or may be text data of a social service (such as text data obtained by combining comment information posted by the sample object on a social application), and the specific type of the sample text data is not limited herein. The sample text data may be text in any language, for example, text in a chinese language, text in an english language, or text in a mixed language including a chinese language and an english language. The application is not limited to the form of the document. The sample text data may include at least one sentence, N being a positive integer; a sentence is a basic unit of language application, and is composed of words and phrases (phrases) and can express a complete meaning, and its end will generally use identifiers such as upper period, question mark, ellipses, exclamation mark, etc.
In some embodiments, the sample text data carries a classification tag. The classification labels can be set by relevant business personnel according to sample text data and actual application scenes. For example, the sample text data is sample session text data, the actual application scenario is risk prediction for the session text data, and thus the classification tag may indicate a specific risk classification, such as contacted malicious information, not contacted malicious information. The trained classification model therefore has different classification functions depending on the classification labels.
In some embodiments, each sample text data may contain one or more sub-texts, and the sub-texts of each sample text data are divided in the same way, and a sample text data is described herein as an example. Let the one piece of text data comprise N sub-texts, N being a positive integer. The electronic device may divide the sample text data according to a preset division rule to obtain N sub-texts, where the preset division rule may be to divide the sample text data according to a separation character included in the sample text data, where the separation character may be set by a relevant service person according to an experience value, and for example, the separation character may be an identifier such as a comma, a period, an exclamation mark, and the like. The sub-text divided at this time may be a complete sentence constituting the sample text data, or a partial text constituting a sentence. The preset division rule may also be to divide the sample text data by a specified text length, for example, dividing every 10 characters in the sample text data into one sub-text, where the specified characters in the sample text data, such as a specified identifier, may be pre-filtered or not filtered during the division. The preset division rule is not limited herein.
S202, sequentially carrying out sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data.
In some embodiments, the electronic device may perform data slicing on the sample text data to obtain a plurality of candidate text data for the sample text data. The data slicing may be implemented based on sliding windows. The sub-text contained in the sample text data may be slide-cut based on a sliding window, for example.
In some embodiments, the process and principle of obtaining candidate text data for each sample text data is the same, and a sample text data determination process is described herein as an example. Let the one piece of text data comprise N sub-text data. Specifically, a sliding window is obtained, wherein the window size of the sliding window is the size of M sub-texts, and M is a positive integer; and sequentially carrying out sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data. When the last candidate text data is segmented, if the number of the remaining sub-texts is smaller than M, the remaining sub-texts can be directly used as the candidate text data; it is also possible to set a default sub-text to supplement the remaining sub-text such that the number of the remaining sub-text after the supplement is equal to M, and use the remaining sub-text after the supplement as candidate text data.
The sliding window may be a window of a fixed model or a window of a variable model. I.e. each time a slide is performed, the corresponding M of the sliding window may be fixed or variable. For example, each time a slide is performed, the window size of the slide window is the size of 3 sub-texts. As another example, the window size of the sliding window is sequentially 1 sub-text size, 3 sub-text size, 5 sub-text size. The size definition rule of the sliding window is not limited herein, and may be specifically set by related business personnel. Thus, any of the candidate text data described above may contain any consecutive at least one of the N sub-texts. It will be appreciated that the length of the sliding window at each division may not be uniform, particularly as determined by the size of the M sub-texts. The size of the M sub-texts is determined based on the number of words or characters specifically included.
Therefore, the sub-text in a sliding window can obtain candidate sample text data, and the candidate sample text data and the corresponding sample text data contain different information. Thus, sample expansion can be achieved with the plurality of candidate sample text data, thereby improving the over-fitting problem caused by training the model with small sample data.
S203, pre-training the classification model based on the sample text data to obtain a pre-trained classification model.
In some embodiments, the electronic device pre-training the classification model based on the sample text data may be training the classification model with the sub-text contained in the sample text data. Accordingly, the pre-training is performed on the classification model based on the sub-text included in the sample text data, so that the pre-trained classification model can be obtained by acquiring sample features corresponding to the sample text data based on the sub-text included in the sample text data, calling the classification model to output a classification result for the sample text data based on the sample features, generating a prediction bias for the pre-trained classification model based on the classification result for the sample text data and the classification label of the sample text data, and correcting model parameters of the classification model based on the prediction bias, so as to obtain the pre-trained classification model. The sample features may include one or more different types of features, such as features that may not be acquired by a classification model and features that may be acquired by a classification model. The sample feature obtaining manner may be specifically set by the relevant service personnel according to an experience value, which is not limited herein. The sample feature type used in the pre-training may be the same as the sample feature type used in training the pre-trained classification model. The specific manner of obtaining the sample features can be found in the following description of the embodiments.
S204, invoking a pre-trained classification model to conduct classification prediction on the plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to prediction accuracy of the pre-trained classification model on the plurality of candidate text data.
In some embodiments, the sample quality of the candidate text data for model training is also different because the amount of information of the candidate text data is varied. Therefore, the electronic equipment can call the pre-trained classification model to conduct classification prediction on a plurality of candidate text data, a prediction result aiming at each candidate text data is obtained, the target text data is determined based on the prediction accuracy represented by the prediction result, the target text data is a sample with higher sample quality, the trained classification model has better training effect by training the model through the target text data, and the model can be trained in a semi-supervised learning mode, so that the condition of overfitting caused by supervised training of the model through small sample data can be improved.
In some embodiments, any of the plurality of candidate text data is represented as target candidate text data, the target candidate text data having a classification tag, the classification prediction result of the pre-trained classification model for the target candidate text data comprising a prediction category for the target candidate text data and a prediction probability for the prediction category. Therefore, according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data, the selecting the target text data from the plurality of candidate text data may specifically include determining that the pre-trained classification model has the prediction accuracy for the target candidate text data and determining that the target candidate text data is the target text data if the prediction category for the target candidate text data is the same as the category indicated by the classification label of the target candidate text data and the prediction probability for the target candidate text data is greater than the probability threshold. The probability threshold may be set by the relevant business person based on an empirical value, such as 0.9.
Wherein the prediction accuracy may measure the predicted performance of the pre-trained classification model on the candidate text data. The prediction probability may be understood as a confidence level, and when the pre-trained classification model predicts the prediction class of the candidate text data with a higher prediction probability (may be understood as being higher than a confidence level threshold) and the prediction class is correct, the pre-trained classification model is provided with a certain capability to predict the correct result of the candidate text data, and the candidate text data is highly reliable for the classification model, so that the candidate text data may be selected to train the model, and the model may learn the characteristics in the candidate text data.
In some embodiments, the classification label of the target candidate text data may be the same classification label as the corresponding sample text data, or may be obtained by performing classification label determination on the target candidate text data according to a classification label determination manner of the sample text data. For example, the sample text data a is subjected to data segmentation to obtain candidate sample text data a.1 and candidate sample text data a.2, and the sample text data B is subjected to data segmentation to obtain candidate sample text data b.1 and candidate sample text data b.2, so that classification labels of the candidate sample text data a.1 and the candidate sample text data a.2 are identical to the sample text data a, and classification labels of the candidate sample text data b.1 and the candidate sample text data b.2 are identical to the sample text data B; or the sample text data A and the sample text data B label the classification labels according to a specified mode, and the candidate sample text data A.1, the candidate sample text data A.2, the candidate sample text data B.1 and the candidate sample text data B.2 label the classification labels according to the specified mode.
S205, training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model.
The trained classification model is used for carrying out classification prediction on the text data.
In some embodiments, the electronic device trains the pre-trained classification model based on the sample text data and the target text data, and the process of obtaining the trained classification model may be the same as the pre-training process. Specifically, any one of sample text data and target text data is represented as target sample data, sample characteristics of the target sample data are obtained, a pre-trained classification model is called to output a classification result for the target sample data based on the sample characteristics, a prediction deviation for the pre-trained classification model is generated based on the classification result for the target sample data and a classification label carried by the target sample data, and model parameters of the pre-trained classification model are corrected based on the prediction deviation, so that a trained classification model is obtained. Further, the electronic device may continuously input the rest candidate data except the target text data in the plurality of candidate text data into the trained classification model according to the above process to perform classification prediction, so as to obtain a classification prediction result of the rest candidate data, select new target text data again from the rest candidate data based on the classification prediction result, and add the new target text data into the sample set to continuously train the trained classification model.
In addition, the electronic device can take sample text data and target text data as new sample sets, take data except the target text data in a plurality of candidate text data as new candidate sets, and continuously carry out iterative training on the trained classification model based on the new sample sets and the new candidate sets according to the process until the model performance is not obviously improved, namely the model converges, so as to obtain a final trained classification model, and carry out multi-round semi-supervised training based on small sample data, so that sample expansion can be realized, sample quality can be ensured, the obtained classification model has better generalization capability and better model effect, and quick and accurate classification of the text data can be realized.
According to the embodiment of the application, sample text data can be obtained, and N sub-texts are sequentially subjected to sliding segmentation based on the sliding window, so that a plurality of candidate text data of the sample text data are obtained; the candidate text data are obtained by sample expansion of sample text data, so that the number of samples can be increased, and the problem of overfitting possibly caused by a small sample training set can be avoided; pre-training the classification model based on the sample text data to obtain a pre-trained classification model; invoking a pre-trained classification model to classify and predict a plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data; based on the prediction accuracy, selecting target text data with higher quality from the candidate text data; training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the model training can be positively effected by the high-quality target text data and sample text data, so that the prediction accuracy of the trained classification model can be improved, and further, the text data can be more accurately classified and predicted by the trained classification model.
Referring to fig. 3, fig. 3 is a flowchart of a data processing method according to an embodiment of the present application, where the method may be performed by the above-mentioned electronic device. As shown in fig. 3, the flow of the data processing method in the embodiment of the present application may include the following:
S301, acquiring sample text data.
In some embodiments, the sample text data carries a classification tag. The sample text data may be any type of text and the category labels may be any type of labels. For ease of explanation, classifying risk of conversational text data is described herein as an example. When the sample text data is sample conversation text data, the method can be used for collecting text data transcribed by conversation records between the external calling robot and different users (namely, multi-round conversation text, which refers to a continuous progressive conversation process which is performed according to the context and achieves a characteristic target); the text data transferred can also be recorded by the conversation between related personnel and different users; and the like, the constitution of the conversation text data is not limited herein.
In some embodiments, the session record includes an artificial language (such as spoken language, angry word, etc.) of the user, or is easily affected by the environment in the communication session, so that the transcribed text data is easy to have transcription errors, and contains dirty data (such as invalid characters), so that the sample text data may be preprocessed in advance, and then the process indicated by the embodiment of the present application is performed on the preprocessed sample text data. The preprocessing may be to remove dirty data, remove stop words, remove duplicate text, and the like, where the specific type of preprocessing is not limited.
In some embodiments, each sample text data may contain one or more sub-text. Let a sample text data comprise N sub-texts. Specifically, the separation characters in the sample text data may be detected, and the sample text data may be divided into N sub-texts based on the detected separation characters, or may be data obtained by dividing the sample text data based on a defined specified length (if the number of the separation characters is less than the specified length, default characters may be supplemented so that the size of the last sub-text satisfies the specified length).
S302, sequentially carrying out sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data.
In some embodiments, the process and principle of sliding segmentation of each sample text data based on the sliding window by the electronic device to obtain corresponding candidate text data are the same. The process of determining candidate text data from one sample of text data is described herein as an example. The process of determining candidate text data of the sample text data by the electronic device may be a process of performing sliding segmentation on N pieces of sub-text data included in the sample text data based on a sliding window. The foregoing segmentation process may be designed in combination with the idea of "N-gram (N-gram, a language model)". The sliding window is segmented by taking the sub-texts as granularity, the window size of the sliding window is the size of M sub-texts, and when the sizes of different continuous M sub-texts are changed, the window size of the corresponding sliding window is also changed.
The sliding window may be divided into a fixed window and a variable window, that is, M is a positive integer, and M may be a fixed value or a variable value. When the sliding window is a fixed window, the size c and step i of the window need to be defined: the window size determines the data volume which can be covered by the current window, the step length determines the sliding distance, and c and i both indicate the number of the sub-texts; that is, assuming that the current sample text data contains t sub-texts, the number k of candidate text data that can be generated by the windowing is:
Wherein, The sign of the rounding down is expressed (provided that the decimal neglected decimal is given later, e.g.The result of (3).
For example, the sample text data contains 16 sub-texts, the window size is 3, and the step size is 2, and then the candidate text data generated by sliding the window is 7, and each candidate text data includes 3 complete sub-texts.
When the sliding window is a variable window, a step i needs to be defined, and the window size will vary according to the specified rule. For example, the window start point is fixed, the window end point increases with the step size i from the initial size c0, and the number k of candidate text data generated through the variable window is:
Wherein, Representing an upper rounding symbol (1 added as long as there is a decimal followed by an integer, e.g.The result of (4).
In some embodiments, the segmenting N sub-texts to obtain candidate text data may further include sliding the segmenting N sub-texts according to a sliding window to obtain R sub-text sets, and combining the R sub-text sets to serve as sample text data corresponding to the candidate text data. Wherein, the data contained in a sub-text set is spliced into a whole to be used as a sub-text. The candidate text data at this time is different from the data content of the sample text data.
For example, as shown in fig. 4a to 4c, fig. 4a to 4c are schematic views of a scene of acquiring candidate text data according to an embodiment of the present application; wherein, as shown in fig. 4a: 1) Acquiring one or more sample text data, dividing each sample text data based on a specified rule 1, and acquiring the sub-text contained in each sample text data (the number of the sub-text of each sample text data can be different or the same); 2) Taking the sub-text contained in each sample text data as an original sample set, wherein the original sample set is used for training a model, and a part of text data can be acquired as a test sample set according to the same acquisition method; 3) Defining relevant parameters of the sliding window; 4) Traversing the sub-text contained in each sample text data based on the related parameters and the specified rule 2 and carrying out data segmentation data to obtain a plurality of candidate text data; 4) Taking a plurality of candidate text data as a candidate sample set, and obtaining a total sample set for training the classification model according to the original sample set (the test sample set can also be included) and the candidate sample set;
Based on the above (2): taking one sample of text data (sample 1) as an example, as shown in (1) in fig. 4b, the specified rule 1 may be divided based on the separation characters, that is, 5 sub-texts may be obtained by dividing according to commas and periods in sample 1; as shown in (2) of fig. 4b, the specified rule 1 may be that text data of the specified character is filtered according to a specified length, that is, the text data is divided according to each 5 characters according to the sample 1 of the separated character is filtered, so as to obtain 6 sub-texts; at this time, when dividing into the sub-text 6, since the number of remaining characters is insufficient, the characters may be supplemented so that the sub-text 6 contains 5 characters;
Based on the above (4): as shown in (1) in fig. 4c, the specified rule 2 may be a sliding segmentation according to a fixed window, and the sub-text in a fixed sliding window is used as candidate text data, that is, the 5 sub-texts of the sample 1 are divided according to the window size of 3 and the step size of 1, so as to obtain 3 candidate text data; when dividing the text into the last candidate text data, if the number of the remaining sub-texts is smaller than 3, the remaining sub-texts can be directly used as the candidate text data, or the remaining sub-texts can be supplemented based on default sub-texts, so that the number of the supplemented remaining sub-texts is equal to 3, and the supplemented remaining sub-texts are used as the candidate text data;
As shown in (2) in fig. 4c, the rule 2 may be to slide and split according to a variable window, and take the sub-text in one variable window as candidate text data, that is, divide the 5 sub-texts of the sample 1 according to a step length of 1 and an initial size of 3 at the end of the window to obtain 3 candidate text data;
As shown in (3) in fig. 4c, the specified rule 2 may be that sliding segmentation is performed according to fixed windows, and the sub-texts in each fixed window are combined into candidate text data, that is, the 5 sub-texts of the sample 1 are divided according to the window size of 3 and the step size of 2, so as to obtain 2 sub-text sets, the data in one sub-text set is spliced into one sub-text, and the 2 sub-text sets are combined to obtain candidate text data.
In some embodiments, after the plurality of candidate text data is acquired, it may also be determined whether there is duplicate data for the plurality of candidate text data, and a deduplication operation is performed on the plurality of candidate text data. At this time, the classification tags that determine the plurality of candidate text data may be labeled in such a manner that the classification tags are determined for the same text data.
S303, pre-training the classification model based on the sample text data to obtain a pre-trained classification model. The specific implementation manner of step S303 may be referred to the related description of the foregoing embodiments, which is not repeated herein.
S304, invoking a pre-trained classification model to classify and predict the plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data.
In some embodiments, the electronic device may invoke a pre-trained classification model to classify and predict a plurality of candidate text data, obtain a classification prediction result for each candidate text data, and determine target text data according to a prediction accuracy characterized by the classification prediction result.
In some embodiments, any one of the plurality of candidate text data is represented as target candidate text data, the target candidate text data having a classification tag, and the manner in which the classification tag of the target candidate text data is determined may be found in the relevant description of the above embodiments. If the classification prediction result of the pre-trained classification model on the target candidate text data comprises a prediction category of the target candidate text data and a prediction probability for the prediction category, selecting the target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model on the plurality of candidate text data, specifically, if the prediction category is the same as the category indicated by the classification label and the prediction probability is greater than a probability threshold, determining that the target text data has the prediction accuracy, and taking the target candidate text data as the target text data.
In some embodiments, if the classification prediction result of the pre-trained classification model on the target candidate text data includes a prediction category of the target candidate text data, selecting the target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model on the plurality of candidate text data may specifically be to determine that the pre-trained classification model has the prediction accuracy on the target candidate text data and determine that the target candidate text data is the target text data if the prediction category of the target candidate text data is the same as the category indicated by the classification label of the target candidate text data.
S305, acquiring a first sample feature corresponding to the sample text data and a second sample feature corresponding to the target text data, calling a pre-trained classification model to output a first classification result for the sample text data based on the first sample feature and a second classification result for the target text data based on the second sample feature.
In some embodiments, the first sample feature corresponding to the sample text data and the second sample feature corresponding to the target text data are processed in the same manner and on the same principle, and any one of the sample text data and the target text data (denoted as target sample data) is taken as an example. The electronic equipment can conduct feature engineering processing on the target sample data to obtain corresponding sample features, and a pre-trained classification model is called to predict the sample features so as to output classification results aiming at the target sample data. The feature engineering process may be set by the relevant business personnel according to the specific application scenario, the resulting sample features may include sample text statistical features for the target sample data, and/or may include sample text features of the target sample data generated by invoking the pre-trained classification model. Wherein the sample text statistics may comprise at least one of: the feature of part-of-speech frequency distribution of characters in the target sample data, the statistical feature of sub-text in the target sample data, or the statistical feature of classification keywords in the target sample data. The feature dimensions of the multiple features in the different sample features are fixed, that is, the feature dimensions of the same feature in the different sample features are the same, and the feature dimensions of the different features may be the same or different.
In some embodiments, the electronic device may call a pre-trained classification model to predict the sample features, sequentially splice multiple features included in the sample features to obtain spliced sample features, and call the pre-trained classification model to predict the spliced sample features.
In some embodiments, the step of obtaining the part-of-speech frequency distribution feature for the characters in the target sample data may be to perform jieba (bargain) word part-of-speech analysis processing on the target sample data, obtain the part-of-speech of each word in the target sample data, and generate the part-of-speech frequency distribution feature according to the part-of-speech of each word. For example, the word segmentation set L for the target sample data contains s words, and for words of part of speech type tc, the part of speech frequency value f tc is:
Where t (x) is defined as a function of the part of speech of the input segmentation word x.
Therefore, a plurality of part-of-speech frequency values can be calculated according to the formula, and the part-of-speech frequency values are spliced to obtain part-of-speech frequency distribution characteristics; or mapping the part-of-speech frequency value into a specified range, and splicing the mapping value of each part-of-speech frequency value to obtain the part-of-speech frequency distribution characteristic. For example, verb frequency is 0.3, noun frequency is 0.1, etc., then part-of-speech frequency distribution is characterized by [0.3,0.1, ]; or mapping part-of-speech frequencies between 0 and 10, wherein the mapping value of the verb frequency is 3, the mapping value of the noun frequency is 1, and the like, and the part-of-speech frequency distribution characteristic is [3, 1. The composition of the part-of-speech frequency distribution feature can be seen, for example, in table 1 below:
TABLE 1
Various parts of speech may be chosen according to a specific application scenario, for example, parts of speech having a part of speech frequency value of 0 or less than a preset threshold may be omitted. In addition, when the conversation text data is obtained based on the conversation record between the outbound robot and the user, the conversation on the outbound robot side is usually a specified conversation template, and a large amount of information is contained in the conversation of the user, so that when the part-of-speech frequency distribution characteristics are calculated, it is also possible to acquire target sub-texts belonging to the user side from N sub-texts, and determine the part-of-speech frequency distribution characteristics based on the target sub-texts. Because the conversation text data obtained through conversation record transfer may have problems of context conflict (such as the situation that a user initially repudiates a certain action but subsequently acknowledges a certain action) and transfer quality (such as spoken text and the like), the spoken text features can be effectively described by adding part-of-speech frequency distribution features, text problems caused by the context conflict and voice transfer can be counteracted, and the distortion of semantic information and the elimination of semantic loss can be supplemented from the perspective of the spoken language of a user, so that the model effect is greatly improved.
In some embodiments, obtaining statistical features for the sub-text in the target sample data specifically includes, but is not limited to, the following types: the number of dialog turns with the user, the average length of the sub-text, the standard deviation of the length of the sub-text, etc. are determined based on the sub-text. It will be appreciated that the statistical characteristics may also be different depending on the type of sample text data.
Therefore, various types of statistical results can be calculated, and the statistical characteristics aiming at the sub-text in the target sample data can be obtained by splicing the statistical results. The composition of this statistical feature can be seen, for example, in table 2 below:
TABLE 2
Wherein a lot of information is included in the user's session, so that when calculating the statistical features for the sub-text in the target sample data, part of the statistical features may also be obtained for the target sub-text belonging to the user side and determined based on the target sub-text. The specific rules may be set by the relevant business personnel based on empirical values.
In some embodiments, the statistical feature of obtaining the classification keyword in the target sample data may specifically be obtaining a specified classification keyword (for example, a keyword related to risk, such as a transfer, a risk application name, etc.), detecting whether the target sample data includes the classification keyword, generating a one-hot code (hot independent code) for each classification keyword according to the detection result, and splicing the one-hot codes for each classification keyword to obtain the statistical feature of the classification keyword. The classification keywords may be set by related business personnel according to actual scenes. And one-hot code generation rules for the classification keywords may be set by the relevant business personnel. Illustratively, if the target sample data contains a sort key, then one-hot is encoded as the first value; if not, the one-hot code is a second value, etc. For example, for the "transfer" keyword, if included, the corresponding one-hot code is generated as 1, for the "risk application name" keyword, if not included, the corresponding one-hot code is generated as 0, and the statistical characteristics are spliced as [1,0]. The composition of the statistical features of the classification keywords may be, for example, as shown in the following table 3:
TABLE 3 Table 3
The method comprises the steps of obtaining a target sub-text belonging to a user side and determining whether the target sub-text contains the classification keywords or not, wherein a large amount of information is contained in a session of the user, so that the statistical characteristics of the classification keywords can be calculated. The specific rules may be set by the relevant business personnel based on empirical values.
In some embodiments, the sample text feature may be specifically obtained by calling a pre-trained classification model to generate a text vector of the target sample data, and using the text vector as the sample text data. The text vector generation method may be to generate a sentence vector of each sub-text, and determine a text vector of the target sample data according to the sentence vector of each sub-text; for example, an average vector of sentence vectors of each sub-text may be used as a text vector, which is not limited herein; or generating word vectors of each word in the sub-text, determining sentence vectors of the corresponding sub-text according to the word vectors of each word, and determining text vectors of the target sample data according to the sentence vectors of each sub-text; for example, an average vector of word vectors of each word segment can be used as a sentence vector of the corresponding sub-text, and the method is not limited herein; the manner of determining the text vector from the sentence vector of the sub-text may be the same as described above.
In some embodiments, when generating text vectors by invoking a pre-trained classification model to generate sentence vectors for each sub-text, a feature generation layer may be built at the classification model, the feature generation layer including a sentence vector generation network, which may be built based on the idea of word2vec (a paragraph vector conversion tool); when the text vector is generated by calling a pre-trained classification model to generate a word vector of each word segment, a feature generation layer may be constructed on the classification model, where the feature generation layer includes a word vector generation network, and the word vector generation network may be constructed based on the idea of CBoW (Continuous Bag-of-Words) or based on Skip-Gram or GloVe (Global Vectors for Word Representation), a word characterization tool based on global word frequency statistics.
In some embodiments, in order to make the sample text feature of the target sample data better cover the semantic information, the technical scheme of the present application adopts 200-dimensional word2vec word vector features, which specifically includes the steps of obtaining a word segmentation set corresponding to the target sample data, training the 200-dimensional word2vec word vector in a classification model by using genism (a natural language processing library), and performing vector transformation on the word segmentation set based on the trained word2vec to obtain the word vector corresponding to the target sample data. There may be other ways of determining the sample text feature, and this is not limiting.
Because the conversation of the outbound robot side in the sample text data is usually a fixed template, a default vector corresponding to the sub-text of the outbound robot side can be preset, when the classification model is called to generate the sample text feature, vector conversion can be only performed on the target sub-text of the user side, and the corresponding sample text feature can be obtained based on the conversion vector for the target sub-text and the default vector.
In some embodiments, the classification model described above may be constructed based on any model structure and concept. The classification model may include a feature prediction layer, which may be understood as a classifier, for predicting the generation of classification results. When the classification result is used for indicating two categories, the classifier is a classifier. The electronic device can select a more appropriate model concept according to specific sample characteristics to construct a characteristic prediction layer in the classification model. For example, if statistical features and part-of-speech frequency distribution features for the sub-text are used, conventional machine learning models such as logistic regression classifiers, random forest classifiers, or decision tree classifiers may be used; if the sample text feature is used, a deep learning model applicable to the text data feature such as RNN (Recurrent Neural Network ), LSTM (Long Short-Term Memory network), GRU (Gated Recurrent Unit, gated loop unit) and the like can be used. According to the technical scheme, the bidirectional LSTM is selected to construct the model structure of the characteristic prediction layer in the classification model. The classification model may also be other model structures, not limited herein.
S306, generating a prediction deviation for the pre-trained classification model based on the first classification result and the second classification result, and correcting model parameters of the pre-trained classification model based on the prediction deviation to obtain a trained classification model.
In some embodiments, the electronic device may construct a loss function, generate a prediction bias for the pre-trained classification model based on the classification result for the target sample data and the classification label carried by the target sample data, correct model parameters of the pre-trained classification model based on the prediction bias, and obtain a trained classification model through continuously iterative semi-supervised training. It will be appreciated that the pre-training process of the classification model is the same as the training process of the pre-trained classification model described above, and the specific types of sample features used are also the same.
For example, as shown in fig. 5, fig. 5 is a schematic flow chart of a training classification model according to an embodiment of the present application; wherein, 1) pre-training the classification model by using an original sample set (namely one or more sample text data) to obtain a pre-trained classification model; 2) Invoking a pre-trained classification model to conduct classification prediction on the candidate sample set (namely a plurality of candidate text data) to obtain a classification prediction result (comprising a prediction category and a prediction probability) of each candidate text data; 3) According to the prediction accuracy characterized by the classification prediction result of each candidate text data, the candidate sample set is divided into a high-reliability candidate sample set and a low-reliability candidate sample set, which can be specifically: candidate sample data with the prediction probability of the prediction category being greater than a probability threshold value is used as a high-reliability candidate sample set, and candidate sample data with the prediction probability of the prediction category being less than or equal to the probability threshold value is used as a low-reliability candidate sample set; 4) Selecting target text data from the high-credibility candidate sample set, and adding the target text data into the original sample set to obtain a new sample set, wherein the new sample set can be specifically: taking candidate text data with the same prediction category as the category indicated by the classification label as target text data; or, randomly sampling the high-reliability candidate sample set, and taking candidate text data with the same predicted category as the category indicated by the classification label in the extracted data as target text data; taking the candidate text data except the target text data in the candidate sample set as a new candidate sample set; 5) And (3) continuing to execute from the step (1) based on the new sample set and the new candidate sample set, and obtaining a trained classification model and an expanded sample set (namely the original sample set used for training the model at the moment) when the performance of the classification model obtained after multiple rounds of training is not obviously improved, namely the multiple prediction results of the classification model aiming at the test sample set are basically not different. The new classification model may then be supervised model training through the expanded sample set.
In some embodiments, taking a trained classification model for classifying risks of text data (here, session text data) as an example, the application of the classification model may be to obtain session text data of a target object (such as a target object), obtain session text statistical features of the session text data, call the trained classification model to generate session text features of the session text data, call the trained classification model to generate risk classification results for the session text data based on the session text statistical features and the session text features, and output risk classification results for the session text data, where the risk classification results may indicate that the session text data has session risks or does not have session risks. The session text data can be obtained by intelligent outbound of the target object by the outbound robot. And if the risk classification result is used for indicating that the session text data has session risk, then early warning is carried out on the target object, such as manual intervention.
In some embodiments, there may be multiple trained classification models, and risk classification results output by the multiple trained classification models jointly determine whether to perform early warning. The plurality of trained classification models can be obtained by training according to the process, or can be obtained by training according to the process to obtain a trained classification model and a final extended sample set, and training other classification models by utilizing the extended sample set.
For example, as shown in fig. 6a, fig. 6a is a schematic flow chart of an application classification model according to an embodiment of the present application; wherein: 1) Acquiring session text data, and processing the session text data to obtain N pieces of sub-session text data contained in the session text data; 2) The feature engineering process is performed on the N sub-session text data, so as to obtain a plurality of classification models (for example: "whether to contact malicious information" classification model, "whether to transfer money" classification model, etc.), it may specifically be: acquiring session text statistical characteristics based on N sub-session text data, respectively calling a plurality of trained classification models to generate session text characteristics aiming at each trained classification model, and taking the session text statistical characteristics and the session text characteristics aiming at each trained classification model as session characteristics which are respectively input; 3) Invoking a plurality of trained classification models to predict based on the session features input by each to obtain a plurality of risk classification results (e.g.: "contacted malicious information", "transferred", "no download risk application", etc.); 4) Determining whether the session text data has session risk or not and whether early warning (such as manual intervention) is performed according to the risk classification results;
The flow diagram of the feature engineering processing on the N sub-session text data may be as shown in fig. 6 b; wherein: 1) Determining statistical characteristics of sub-session text data in the session text data, such as average lengths of sub-session text data at a user side in the N sub-session text data; 2) Determining part-of-speech frequency distribution characteristics for characters in the conversation text data; 3) Aiming at the statistical characteristics of the classified keywords in the session text data; 4) And calling the trained classification model to generate sample text features of the session text data.
As another example, as shown in fig. 7, fig. 7 is a schematic view of a risk early warning scene based on a classification model according to an embodiment of the present application; taking risk classification of session text data as an example:
Acquiring early warning data, wherein the early warning data can comprise object information, risk sources and the like of a target object; the early warning data can be generated and sent by upstream business equipment, and can also be detected and generated by electronic equipment; for example, a security protection program is installed in a terminal of a target object, and the electronic device can be background equipment of the security protection program and has the authority of detecting abnormal operation behaviors on the terminal, such as answering a malicious incoming call; or a security plug-in is embedded in a target application (such as a browser) installed in a terminal of the target object, and the electronic device can be background equipment of the security plug-in and has the authority of detecting abnormal operation behaviors on the target application, such as browsing malicious websites;
the intelligent voice module in the electronic equipment or the outbound equipment can initiate intelligent outbound to the terminal of the target object based on the early warning data to obtain a session record, process the session record, for example, transcribe and display the session record, and splice the transcribed session content to obtain session text data; templates for intelligent outbound of different target objects can be different and can be specifically set by related business personnel;
Risk prediction is performed based on session text data to obtain processing results, such as: invoking a plurality of trained classification models to conduct classification prediction on the session text data, obtaining a plurality of classification prediction results (such as transferred accounts and contacted malicious information … …) aiming at risk prediction, and determining whether risk early warning (such as manual intervention) is needed or not based on the plurality of classification prediction results;
Thus, the overall process flow for the trained classification model may include: the training process includes three modules: sample pretreatment, feature engineering treatment and iterative semi-supervised training; the application process comprises a module: and (5) text classification. See for details the following table 4:
TABLE 4 Table 4
According to the embodiment of the application, sample text data can be obtained, and N sub-texts are sequentially subjected to sliding segmentation based on the sliding window, so that a plurality of candidate text data of the sample text data are obtained; the candidate text data are obtained by sample expansion of sample text data, so that the number of samples can be increased, and the problem of overfitting possibly caused by a small sample training set can be avoided; pre-training the classification model based on the sample text data to obtain a pre-trained classification model; invoking a pre-trained classification model to classify and predict a plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data; based on the prediction accuracy, selecting target text data with higher quality from the candidate text data; acquiring a first sample feature corresponding to sample text data and a second sample feature corresponding to target text data, calling a pre-trained classification model to output a first classification result for the sample text data based on the first sample feature and a second classification result for the target text data based on the second sample feature, generating a prediction bias for the pre-trained classification model based on the first classification result and the second classification result, and correcting model parameters of the pre-trained classification model based on the prediction bias to obtain a trained classification model; the model training can be positively effected by the high-quality target text data and sample text data, so that the prediction accuracy of the trained classification model can be improved, and further, the text data can be more accurately classified and predicted by the trained classification model.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to the present application. It should be noted that, the data processing apparatus shown in fig. 8 is used to execute the method of the embodiment shown in fig. 2 and 3, and for convenience of explanation, only the portion relevant to the embodiment of the present application is shown, and specific technical details are not disclosed, and reference is made to the embodiment shown in fig. 2 and 3 of the present application. The data processing apparatus 800 may include: an acquisition module 801 and a processing module 802. Wherein:
An obtaining module 801, configured to obtain sample text data; the number of the sample text data is smaller than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
A processing module 802, configured to sequentially slide-segment the N sub-texts based on the sliding window, to obtain multiple candidate text data of the sample text data; any one candidate text data comprises at least one sub-text in any continuous form of N sub-texts;
the processing module 802 is further configured to train the classification model based on the sample text data to obtain a pre-trained classification model;
The processing module 802 is further configured to invoke a pre-trained classification model to perform classification prediction on the plurality of candidate text data, and select target text data from the plurality of candidate text data according to prediction accuracy of the pre-trained classification model for the plurality of candidate text data;
The processing module 802 is further configured to train the pre-trained classification model based on the sample text data and the target text data, to obtain a trained classification model; the trained classification model is used for carrying out classification prediction on the text data.
In some embodiments, the acquisition module 801 is further configured to:
detecting separation characters in sample text data;
the sample text data is divided into N sub-texts based on the detected separator characters.
In some embodiments, any one of the plurality of candidate text data is represented as target candidate text data, the target candidate text data having a classification tag, the classification prediction result of the pre-trained classification model for the target candidate text data comprising a prediction category for the target candidate text data and a prediction probability for the prediction category;
The processing module 802, when configured to select target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data, is specifically configured to:
if the prediction category of the target candidate text data is the same as the category indicated by the classification label of the target candidate text data and the prediction probability of the target candidate text data is greater than the probability threshold, determining that the pre-trained classification model has prediction accuracy on the target candidate text data and determining that the target candidate text data is the target text data.
In some embodiments, any one of the plurality of candidate text data is represented as target candidate text data, the target candidate text data having a classification tag, the classification prediction result of the pre-trained classification model for the target candidate text data comprising a prediction category for the target candidate text data;
The processing module 802, when configured to select target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data, is specifically configured to:
If the prediction category of the target candidate text data is the same as the category indicated by the classification label of the target candidate text data, determining that the pre-trained classification model has prediction accuracy on the target candidate text data, and determining that the target candidate text data is the target text data.
In some embodiments, any one of the sample text data and the target text data is represented as target sample data; the processing module 802 is specifically configured to, when configured to train the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model:
acquiring sample text statistical characteristics aiming at target sample data;
Invoking a pre-trained classification model to generate sample text features of target sample data;
Invoking a pre-trained classification model to output a classification result aiming at target sample data based on the sample text statistical characteristics and the sample text characteristics;
Generating a prediction bias for the pre-trained classification model based on the classification result for the target sample data and the classification label carried by the target sample data;
model parameters of the pre-trained classification model are corrected based on the prediction bias, and a trained classification model is obtained.
In some embodiments, the sample text statistics include at least one of: the feature of part-of-speech frequency distribution of characters in the target sample data, the statistical feature of sub-text in the target sample data, or the statistical feature of classification keywords in the target sample data.
In some embodiments, the trained classification model is used to classify the risk of the text data;
the processing module 802 is further configured to:
Acquiring session text data of a target object, and acquiring session text statistical characteristics of the session text data;
Calling the trained classification model to generate session text characteristics of the session text data;
Invoking the trained classification model to output a risk classification result aiming at the session text data based on the session text statistical characteristics and the session text characteristics;
the processing module 802 is further configured to:
And if the risk classification result is used for indicating that the session text data has session risk, early warning is carried out on the target object.
In the embodiment of the application, an acquisition module acquires sample text data; the processing module sequentially performs sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data; the processing module trains the classification model based on the sample text data to obtain a pre-trained classification model; the processing module invokes a pre-trained classification model to classify and predict the plurality of candidate text data, and selects target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data; the processing module trains the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for classifying and predicting the text data. Through the device, the candidate text data are obtained by expanding samples of the sample text data, so that the number of samples can be increased, the problem of overfitting possibly caused by a small sample training set is avoided, the target text data with higher quality can be selected from the candidate text data based on the prediction accuracy, the model training can have positive effects through the high-quality target text data and the sample text data, the prediction accuracy of the trained classification model can be improved, and further, the text data can be more accurately classified and predicted through the trained classification model.
The functional modules in the embodiments of the present application may be integrated into one module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules, which is not limited by the present application.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 9, the electronic device 900 includes: at least one processor 901, a memory 902. Optionally, the electronic device may further comprise a network interface. Wherein data can be interacted between the processor 901, the memory 902 and a network interface, the network interface is controlled by the processor 901 for receiving and transmitting messages, the memory 902 is used for storing a computer program, the computer program comprises program instructions, and the processor 901 is used for executing the program instructions stored in the memory 902. Wherein the processor 901 is configured to invoke the program instructions to perform the above method.
The memory 902 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory 902 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid state disk (solid-state drive-STATE DRIVE, SSD), etc.; the memory 902 may also include a combination of the above types of memory.
The processor 901 may be a central processing unit (central processing unit, CPU), among others. In one embodiment, processor 901 may also be a graphics processor (Graphics Processing Unit, GPU). The processor 901 may also be a combination of a CPU and a GPU.
In one possible implementation, the memory 902 is configured to store program instructions that the processor 901 may call to perform the steps of:
acquiring sample text data; the number of the sample text data is smaller than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
Sequentially carrying out sliding segmentation on the N sub-texts based on the sliding window to obtain a plurality of candidate text data of the sample text data; any one candidate text data comprises at least one sub-text in any continuous form of N sub-texts;
Pre-training the classification model based on the sample text data to obtain a pre-trained classification model;
Invoking a pre-trained classification model to classify and predict a plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data;
Training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for classifying and predicting the text data.
In some embodiments, the processor 901 is further for:
detecting separation characters in sample text data;
the sample text data is divided into N sub-texts based on the detected separator characters.
In some embodiments, any one of the plurality of candidate text data is represented as target candidate text data, the target candidate text data having a classification tag, the classification prediction result of the pre-trained classification model for the target candidate text data comprising a prediction category for the target candidate text data and a prediction probability for the prediction category;
The processor 901, when configured to select target text data from a plurality of candidate text data according to a pre-trained classification model for prediction accuracy of the plurality of candidate text data, is specifically configured to:
if the prediction category of the target candidate text data is the same as the category indicated by the classification label of the target candidate text data and the prediction probability of the target candidate text data is greater than the probability threshold, determining that the pre-trained classification model has prediction accuracy on the target candidate text data and determining that the target candidate text data is the target text data.
In some embodiments, any one of the plurality of candidate text data is represented as target candidate text data, the target candidate text data having a classification tag, the classification prediction result of the pre-trained classification model for the target candidate text data comprising a prediction category for the target candidate text data;
The processor 901, when configured to select target text data from a plurality of candidate text data according to a pre-trained classification model for prediction accuracy of the plurality of candidate text data, is specifically configured to:
If the prediction category of the target candidate text data is the same as the category indicated by the classification label of the target candidate text data, determining that the pre-trained classification model has prediction accuracy on the target candidate text data, and determining that the target candidate text data is the target text data.
In some embodiments, any one of the sample text data and the target text data is represented as target sample data; the processor 901 is configured to, when configured to train the pre-trained classification model based on the sample text data and the target text data, obtain a trained classification model, specifically:
acquiring sample text statistical characteristics aiming at target sample data;
Invoking a pre-trained classification model to generate sample text features of target sample data;
Invoking a pre-trained classification model to output a classification result aiming at target sample data based on the sample text statistical characteristics and the sample text characteristics;
Generating a prediction bias for the pre-trained classification model based on the classification result for the target sample data and the classification label carried by the target sample data;
model parameters of the pre-trained classification model are corrected based on the prediction bias, and a trained classification model is obtained.
In some embodiments, the sample text statistics include at least one of: the feature of part-of-speech frequency distribution of characters in the target sample data, the statistical feature of sub-text in the target sample data, or the statistical feature of classification keywords in the target sample data.
In some embodiments, the trained classification model is used to classify the risk of the text data;
the processor 901 is also for:
Acquiring session text data of a target object, and acquiring session text statistical characteristics of the session text data;
Calling the trained classification model to generate session text characteristics of the session text data;
Invoking the trained classification model to output a risk classification result aiming at the session text data based on the session text statistical characteristics and the session text characteristics;
the processor 901 is also for:
And if the risk classification result is used for indicating that the session text data has session risk, early warning is carried out on the target object.
In specific implementation, the above-described devices, processors, memories, etc. may perform the implementation described in the above-described method embodiments, or may perform the implementation described in the embodiment of the present application, which is not described herein again.
In an embodiment of the present application, there is further provided a computer (readable) storage medium storing a computer program, where the computer program includes program instructions, where the program instructions, when executed by a processor, cause the processor to perform some or all of the steps performed in the method embodiment described above. In the alternative, the computer storage medium may be volatile, but may also be non-volatile. The computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.
Embodiments of the present application also provide a computer program product comprising computer instructions which, when executed by a processor, implement some or all of the steps of the above-described method.
References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program for instructing related hardware, and the above-described program may be stored in a computer storage medium, which may be a computer-readable storage medium, and the program may include the embodiment flow of each of the above-described methods when executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random-access Memory (Random Access Memory, RAM), or the like.
The above disclosure is only a few examples of the present application, and it is not intended to limit the scope of the present application, but it is understood by those skilled in the art that all or a part of the above embodiments may be implemented and equivalent changes may be made in the claims of the present application.

Claims (11)

1. A method of data processing, the method comprising:
Acquiring sample text data; the number of the sample text data is smaller than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
sequentially performing sliding segmentation on the N sub-texts based on a sliding window to obtain a plurality of candidate text data of the sample text data; any one candidate text data comprises at least one sub-text in any continuous form among the N sub-texts;
pre-training the classification model based on the sample text data to obtain a pre-trained classification model;
Invoking the pre-trained classification model to classify and predict the plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to the prediction accuracy of the pre-trained classification model for the plurality of candidate text data;
training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for carrying out classification prediction on the text data.
2. The method according to claim 1, wherein the method further comprises:
Detecting separation characters in the sample text data;
The sample text data is divided into the N sub-texts based on the detected separator characters.
3. The method of claim 1, wherein any one of the plurality of candidate text data is represented as target candidate text data, the target candidate text data having a classification tag, the classification prediction result of the pre-trained classification model for the target candidate text data comprising a prediction category for the target candidate text data and a prediction probability for the prediction category;
The selecting target text data from the plurality of candidate text data according to the predictive accuracy of the pre-trained classification model for the plurality of candidate text data includes:
If the prediction category of the target candidate text data is the same as the category indicated by the classification label of the target candidate text data and the prediction probability of the target candidate text data is greater than a probability threshold, determining that the pre-trained classification model has prediction accuracy for the target candidate text data and determining that the target candidate text data is the target text data.
4. The method of claim 1, wherein any of the plurality of candidate text data is represented as target candidate text data, the target candidate text data having a classification tag, the classification prediction result of the target candidate text data by the pre-trained classification model comprising a prediction category for the target candidate text data;
The selecting target text data from the plurality of candidate text data according to the predictive accuracy of the pre-trained classification model for the plurality of candidate text data includes:
If the prediction category of the target candidate text data is the same as the category indicated by the classification label of the target candidate text data, determining that the pre-trained classification model has prediction accuracy on the target candidate text data, and determining that the target candidate text data is the target text data.
5. The method of claim 1, wherein any one of the sample text data and the target text data is represented as target sample data; training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model, including:
Acquiring sample text statistical characteristics aiming at the target sample data;
invoking the pre-trained classification model to generate sample text features of the target sample data;
Invoking the pre-trained classification model to output a classification result for the target sample data based on the sample text statistical features and the sample text features;
generating a prediction bias for the pre-trained classification model based on classification results for the target sample data and classification labels carried by the target sample data;
and correcting model parameters of the pre-trained classification model based on the prediction deviation to obtain the trained classification model.
6. The method of claim 5, wherein the sample text statistics comprise at least one of: aiming at part-of-speech frequency distribution characteristics of characters in the target sample data, statistical characteristics of sub-texts in the target sample data or statistical characteristics of classification keywords in the target sample data.
7. The method of claim 1, wherein the trained classification model is used to classify a risk of text data; the method further comprises the steps of:
Acquiring session text data of a target object, and acquiring session text statistical characteristics of the session text data;
calling the trained classification model to generate session text characteristics of the session text data;
Invoking the trained classification model to output a risk classification result for the session text data based on the session text statistical features and the session text features;
the method further comprises the steps of:
and if the risk classification result is used for indicating that the session text data has session risk, early warning is carried out on the target object.
8. A data processing apparatus, the apparatus comprising:
The acquisition module is used for acquiring sample text data; the number of the sample text data is smaller than the number of the sample indexes; the sample text data carries a classification label; the sample text data comprises N sub-texts, wherein N is a positive integer;
The processing module is used for sequentially carrying out sliding segmentation on the N sub-texts based on a sliding window to obtain a plurality of candidate text data of the sample text data; any one candidate text data comprises at least one sub-text in any continuous form among the N sub-texts;
the processing module is further used for training the classification model based on the sample text data to obtain a pre-trained classification model;
The processing module is further used for calling the pre-trained classification model to conduct classification prediction on the plurality of candidate text data, and selecting target text data from the plurality of candidate text data according to prediction accuracy of the pre-trained classification model on the plurality of candidate text data;
The processing module is further used for training the pre-trained classification model based on the sample text data and the target text data to obtain a trained classification model; the trained classification model is used for carrying out classification prediction on the text data.
9. An electronic device comprising a processor and a memory, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-7.
11. A computer program product comprising computer instructions which, when executed by a processor, implement the method of any of claims 1-7.
CN202210756062.XA 2022-06-29 2022-06-29 Data processing method, device, electronic equipment, medium and program product Active CN115098680B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210756062.XA CN115098680B (en) 2022-06-29 2022-06-29 Data processing method, device, electronic equipment, medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210756062.XA CN115098680B (en) 2022-06-29 2022-06-29 Data processing method, device, electronic equipment, medium and program product

Publications (2)

Publication Number Publication Date
CN115098680A CN115098680A (en) 2022-09-23
CN115098680B true CN115098680B (en) 2024-08-09

Family

ID=83294589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210756062.XA Active CN115098680B (en) 2022-06-29 2022-06-29 Data processing method, device, electronic equipment, medium and program product

Country Status (1)

Country Link
CN (1) CN115098680B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960800A (en) * 2019-03-13 2019-07-02 安徽省泰岳祥升软件有限公司 Weakly supervised text classification method and device based on active learning
CN112100378A (en) * 2020-09-15 2020-12-18 中国平安人寿保险股份有限公司 Text classification model training method, device, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813932B (en) * 2020-06-17 2023-11-14 北京小米松果电子有限公司 Text data processing method, text data classifying device and readable storage medium
CN113312899B (en) * 2021-06-18 2023-07-04 网易(杭州)网络有限公司 Text classification method and device and electronic equipment
CN113688239B (en) * 2021-08-20 2024-04-16 平安国际智慧城市科技股份有限公司 Text classification method and device under small sample, electronic equipment and storage medium
CN113918720A (en) * 2021-10-29 2022-01-11 平安普惠企业管理有限公司 Training method, device and equipment of text classification model and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960800A (en) * 2019-03-13 2019-07-02 安徽省泰岳祥升软件有限公司 Weakly supervised text classification method and device based on active learning
CN112100378A (en) * 2020-09-15 2020-12-18 中国平安人寿保险股份有限公司 Text classification model training method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN115098680A (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN110765244B (en) Method, device, computer equipment and storage medium for obtaining answering operation
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
US10827024B1 (en) Realtime bandwidth-based communication for assistant systems
US10831796B2 (en) Tone optimization for digital content
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
CN111291195B (en) Data processing method, device, terminal and readable storage medium
US10803253B2 (en) Method and device for extracting point of interest from natural language sentences
JP2023535709A (en) Language expression model system, pre-training method, device, device and medium
CN111651996A (en) Abstract generation method and device, electronic equipment and storage medium
US11636272B2 (en) Hybrid natural language understanding
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN112860896A (en) Corpus generalization method and man-machine conversation emotion analysis method for industrial field
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN112101042A (en) Text emotion recognition method and device, terminal device and storage medium
CN117271736A (en) Question-answer pair generation method and system, electronic equipment and storage medium
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN112308453B (en) Risk identification model training method, user risk identification method and related devices
CN110969005A (en) Method and device for determining similarity between entity corpora
CN110377706B (en) Search sentence mining method and device based on deep learning
CN115952854B (en) Training method of text desensitization model, text desensitization method and application
CN115098680B (en) Data processing method, device, electronic equipment, medium and program product
CN113590768B (en) Training method and device for text relevance model, question answering method and device
CN115292495A (en) Emotion analysis method and device, electronic equipment and storage medium
CN115221298A (en) Question and answer matching method and device, electronic equipment and storage medium
CN115292492A (en) Method, device and equipment for training intention classification model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40074439

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant