[go: up one dir, main page]

CN113434671B - Data processing method, device, computer equipment and storage medium - Google Patents

Data processing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN113434671B
CN113434671B CN202110700948.8A CN202110700948A CN113434671B CN 113434671 B CN113434671 B CN 113434671B CN 202110700948 A CN202110700948 A CN 202110700948A CN 113434671 B CN113434671 B CN 113434671B
Authority
CN
China
Prior art keywords
text
similarity
target
sample
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110700948.8A
Other languages
Chinese (zh)
Other versions
CN113434671A (en
Inventor
陈庆伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202110700948.8A priority Critical patent/CN113434671B/en
Publication of CN113434671A publication Critical patent/CN113434671A/en
Application granted granted Critical
Publication of CN113434671B publication Critical patent/CN113434671B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method, a device, computer equipment and a storage medium, comprising the following steps: obtaining the text similarity of every two texts in a target Batch, wherein the target Batch comprises N samples, each sample comprises two texts, and N is a positive integer; constructing a similarity matrix according to the text similarity of each two texts and the labels of each sample in the N samples, and constructing a label matrix according to the labels of each sample in the N samples; obtaining a loss function value of the target Batch based on the similarity matrix and the label matrix, and adjusting parameters of a target model based on the loss function value; the similarity matrix is the same as the label matrix in number of rows and columns and is obtained based on texts of the same sequence; one row of the similarity matrix comprises two text similarities contained in one and the same sample; one row of the tag matrix includes a tag of one sample.

Description

Data processing method, device, computer equipment and storage medium
Technical Field
The embodiment of the invention relates to the field of deep learning, in particular to a data processing method, a data processing device, computer equipment and a storage medium.
Background
The most important application question-answering robot in the field of natural language processing (Natural Language Processing, NLP) is used for finding out the standard question with the highest similarity with the question input by the user voice from a question-answering library in a retrieval mode after acquiring the text content input by the user voice, and then returning the corresponding standard question answer.
In the related art, the text similarity matching algorithm based on deep learning mainly has the problems of large calculated amount, slow training process, weak generalization and the like.
Disclosure of Invention
The embodiment of the invention provides a data processing method, a device, computer equipment and a storage medium, which can improve the training speed of a model and increase the generalization performance of the trained model.
In order to solve the technical problems, the embodiment of the invention adopts the following technical scheme: there is provided a method of processing data, the method comprising,
Comprising the following steps: obtaining the text similarity of every two texts in a target Batch, wherein the target Batch comprises N samples, each sample comprises two texts, and N is a positive integer; constructing a similarity matrix according to the text similarity of each two texts and the labels of each sample in the N samples, and constructing a label matrix according to the labels of each sample in the N samples; obtaining a loss function value of the target Batch based on the similarity matrix and the label matrix, and adjusting parameters of a target model based on the loss function value; the similarity matrix is the same as the label matrix in number of rows and columns and is obtained based on texts of the same sequence; one row of the similarity matrix comprises two text similarities contained in one and the same sample; one row of the tag matrix includes a tag of one sample.
In some manners, the obtaining the text similarity of each two texts in the target Batch includes: acquiring a characterization vector of each text in N samples; and obtaining the text similarity of each two texts based on the characterization vector of each text.
In some manners, before the constructing a similarity matrix according to the text similarity of each two texts and the label of each sample in the N samples, and constructing a label matrix according to the label of each sample in the N samples, the method further includes: if the similarity between the first text and the second text is greater than the preset similarity, determining the first text and the second text as positive samples; if the similarity of the third text and the fourth text is smaller than or equal to the preset similarity, determining a negative sample from the third text and the fourth text; the first text, the second text, the third text and the fourth text are all texts in the N samples.
In some manners, the constructing a similarity matrix according to the text similarity construction of every two texts and the label of each sample in the N samples includes: constructing a first auxiliary column of the similarity matrix according to the label of each sample in the N samples; if the first target row comprises a positive sample, setting the value of the first auxiliary column in the first target row to 0; if the first target row does not have a positive sample, setting the value of the first auxiliary column in the first target row as the preset similarity; wherein the first target acts on any row in the similarity matrix.
In some aspects, the constructing a tag matrix from the tags of each of the N samples includes: constructing a second auxiliary column of the label matrix according to the label of each sample in the N samples; if the second target row comprises a positive sample, setting the value of the first auxiliary column in the second target row to 0; if the second target row does not have a positive sample, setting the value of the first auxiliary column in the second target row to be 1; wherein the second target acts on any row in the tag matrix.
In some aspects, the calculating the loss function value of the target Batch based on the similarity matrix and the tag matrix includes: adjusting the value of each row in the similarity matrix based on a target adjustment coefficient, wherein the sum of the similarity included in each row of the adjusted similarity matrix is close to 1; and acquiring cross entropy of the adjusted similarity matrix and the label matrix, and acquiring a loss function value of the target Batch based on the cross entropy.
In some aspects, the adjusting the parameters of the target model based on the loss function value includes: and under the condition that the target Batch is a negative sample and the label of the target Batch is 1, generating an adjustment threshold based on a random gradient descent method, and adjusting parameters of the target model according to the adjustment threshold so as to adjust the similarity of two texts in the target Batch.
In order to solve the above technical problem, an embodiment of the present invention further provides a data processing apparatus, including: the acquisition module is used for acquiring the text similarity of every two texts in the target Batch, wherein the target Batch comprises N samples, each sample comprises two texts, and N is a positive integer; the construction module is used for constructing a similarity matrix according to the text similarity of each two texts and the labels of each sample in the N samples, which are acquired by the acquisition module, and constructing a label matrix according to the labels of each sample in the N samples; the adjustment module is used for obtaining a loss function value of the target Batch based on the similarity matrix constructed by the construction module and the label matrix, and adjusting parameters of a target model based on the loss function value; the similarity matrix is the same as the label matrix in number of rows and columns and is obtained based on texts of the same sequence; one row of the similarity matrix comprises two text similarities in the same sample; one row of the tag matrix includes a tag of one sample.
In some manners, the obtaining module is specifically configured to obtain a token vector of each text in the N samples; the obtaining module is specifically further configured to obtain a text similarity of each two texts based on the representation vector of each text.
In some aspects, the apparatus further comprises: a determining module; the determining module is configured to determine the first text and the second text as positive samples if the similarity between the first text and the second text is greater than a preset similarity; the determining module is further configured to determine a negative sample from the third text and the fourth text if the similarity between the third text and the fourth text is less than or equal to a preset similarity; the first text, the second text, the third text and the fourth text are all texts in the N samples.
In some manners, the construction module is specifically configured to construct a first auxiliary column of the similarity matrix according to the label of each sample in the N samples; the construction module is specifically further configured to set a value of the first auxiliary column in the first target row to 0 if the first target row includes a positive sample; the construction module is specifically further configured to set a value of the first auxiliary column in the first target row to the preset similarity if the first target row does not have a positive sample; wherein the first target acts on any row in the similarity matrix.
In some manners, the construction module is specifically configured to construct a second auxiliary column of the label matrix according to the label of each sample in the N samples; the construction module is specifically further configured to set a value of the second auxiliary column in the second target row to 0 if the second target row includes a positive sample; the construction module is specifically further configured to set a value of the second auxiliary column in the second target row to 1 if the second target row has no positive sample; wherein the second target acts on any row in the tag matrix.
In some manners, the adjusting module is specifically configured to adjust a value of each row in the similarity matrix based on a target adjustment coefficient, where a sum of similarities included in each row of the adjusted similarity matrix is close to 1; the adjusting module is specifically further configured to obtain a cross entropy of the adjusted similarity matrix and the tag matrix, and obtain a loss function value of the target Batch based on the cross entropy.
In some manners, the adjustment module is specifically configured to generate an adjustment threshold based on a random gradient descent method when the target Batch is a negative sample and the label of the target Batch is 1, and adjust parameters of the target model according to the adjustment threshold to adjust the similarity of two texts in the target Batch.
In order to solve the above technical problem, an embodiment of the present invention further provides a computer device, which includes a memory and a processor, where the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the processor is caused to execute the steps of the data processing method.
To solve the above technical problem, embodiments of the present invention further provide a storage medium storing computer readable instructions, where the computer readable instructions when executed by one or more processors cause the one or more processors to perform the steps of the data processing method described above.
The embodiment of the invention has the beneficial effects that: through acquiring texts in a plurality of samples contained in each Batch in the training sample set and calculating the text similarity between every two texts in the texts, the utilization of the samples is improved, the model can learn more negative samples, and the generalization performance of the model after training is improved. Meanwhile, as each sample of each Batch is not mutually independent, the training speed and the training efficiency of the model are greatly improved.
Drawings
The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a data processing method according to an embodiment of the application;
FIG. 2 is a second flow chart of a data processing method according to an embodiment of the application;
FIG. 3 is a third flow chart of a data processing method according to an embodiment of the application;
FIG. 4 is a flow chart of a data processing method according to an embodiment of the application;
FIG. 5 is a flowchart of a data processing method according to an embodiment of the present application;
FIG. 6 is a flowchart of a data processing method according to an embodiment of the present application;
FIG. 7 is a flow chart of a data processing method according to an embodiment of the application;
FIG. 8 is a schematic diagram of a basic structure of a data processing apparatus according to an embodiment of the present application;
fig. 9 is a basic structural block diagram of a computer device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In the related art, in order to achieve communication between a person and a machine, the machine needs to process voice or text input by the person and then match corresponding answer contents from a question-answer library. In general, the accuracy of text matching can be improved based on a deep learning model.
Aiming at a text similarity matching algorithm in the related technology, the method mainly comprises the following two implementation modes:
Mode 1:
In mode 1, a Text a (text_a) and a Text b (text_b) may be spliced into a sentence as an input of a model through a special classification embedding (class) and a special marking (sep), and then predicted through a two-class model formed by modules such as a convolutional neural network (convolutional neural networks, CNN), a long-term memory network (LSTM), or an Encode of Tansformer, and whether the two texts are semantically similar is determined according to the result. The method has high algorithm accuracy, but needs to be matched with each text in the question-answering library once, has low efficiency, and is only used for the later accurate elimination process.
Mode 2:
In mode 2, a feature extraction model may be trained to first format the input text. Then, the fields subjected to format conversion are matched with the fields in the matching library in a vector retrieval mode, and the mode is high in efficiency and widely used in a coarse arrangement process. However, the feature extraction model is usually required to construct a classification model or a twin network TripletLoss for training, and then the feature extraction model is separated from the model of the village OK for extracting the features of the text content, so that the training speed is slow. In addition, because the matching between samples is less in the training process, each sample is independent, and the generalization of the trained model is not strong.
Taking the training process of the twin network TripletLoss as an example, the similarity of the positive samples text_a_i and text_b_i is p_true (the value ranges from 0 to 1), and the similarity of the negative samples text_a_i and text_c_i is p_false (the value ranges from 0 to 1), and at this time, the loss function value of the samples is p_true-p_false. For the Batch where the sample is located, the loss function value of the Batch is an average value of the loss function values of the respective samples. Compared with the training mode in the mode 1, the complexity is improved, but all samples in each Batch are still independent, the training speed is low, and the generalization performance of the trained model is poor.
Aiming at the problems in the training mode, the application designs a data processing method which can improve the utilization rate of samples and improve the training speed under the condition that the number of the samples is unchanged. In addition, as the similarity matching is participated among all the samples, the generalization performance of the trained model is stronger.
As shown in fig. 1, a flow chart of a data processing method provided in this embodiment includes S201 to S203:
S201, obtaining the text similarity of every two texts in the target Batch.
The target Batch comprises N samples, each sample comprises two texts, and N is a positive integer.
Illustratively, the target Batch is one Batch in a training sample set, where the training sample set includes a plurality of batches, and each Batch includes a plurality of samples. The Batch in the training sample set is used for defining the number of samples to be processed before the parameters of the model are updated, that is, the parameters of the model are updated after the samples contained in one Batch are processed. Typically, one Batch may include 32 samples, 64 samples, 128 samples, or the like, based on the computing power of the electronic device.
Illustratively, after the target Batch is obtained, the text contained in each sample in the target Batch is extracted, and the text similarity between every two texts is calculated.
It will be appreciated that where the target Batch includes N samples, and each sample includes two texts, the target Batch includes 2*N texts.
In one possible implementation, the text similarity for each two texts may be calculated by the following steps.
Illustratively, as shown in fig. 2, the above step 201 may include the following S201a1 and S201a2:
s201a1, obtaining a characterization vector of each text in N samples.
For example, a feature extraction model may be used to extract a token vector for each text in the N samples through a particular transformation relationship.
Note that the token vector may be represented using embedding, and the embedding may be a number with a length of 768. That is, the feature extraction model converts text content into numbers with length 768 for representation through a specific conversion relation. The embedding is the above-mentioned characterization vector.
And S201a2, obtaining the text similarity of every two texts based on the characterization vector of each text.
For example, after the token vector of each text is obtained, the text similarity between every two texts can be obtained through calculation according to the token vector of each text.
S202, constructing a similarity matrix according to the text similarity of every two texts and the labels of each sample in the N samples, and constructing a label matrix according to the labels of each sample in the N samples.
Illustratively, after obtaining the text similarity of each two texts in 2*N texts, a similarity matrix is constructed based on the similarity and the label of each sample. And, it is also necessary to construct a tag matrix based on the tags of each sample.
Wherein the similarity matrix is the same as the label matrix in number of rows and columns and is obtained based on the text of the same sequence. One row of the similarity matrix includes two text similarities in the same sample. One row of the tag matrix described above includes a tag of one sample.
Specifically, the similarity matrix and the label matrix may be respectively constructed based on the calculation results of the text in each sample and the text in other samples in the target Batch. Different matrices may be constructed based on the calculation of different parameters for the two texts.
For example, the target Batch contains 5 samples, each sample is represented by (Ai, bj), the results of Ai (i having a value range of 1 to 5) and Bj (j having a value range of 1 to 5) are calculated, respectively, and a corresponding matrix is constructed based on the results.
The object model may be a classification model, for example.
For example, for the process of constructing the similarity matrix and the tag matrix, construction needs to be performed based on texts of the same sequence, and for each line of the similarity matrix, similarity of two texts contained in one and the same sample is included. For each row of the tag matrix, a tag of one sample is included.
Illustratively, taking the example that the target Batch includes 5 samples (Ai, bj), the similarity matrix constructed in the step 202 may be as shown in the following table 1:
TABLE 1
Illustratively, table 1 above is a text similarity construct between Ai and Bi (i, j take values 1 to 5) in 5 samples, respectively.
Illustratively, the tag matrix constructed in step 202 above may be as shown in Table 2 below:
B1 B2 B3 B4 B5
A1 1 0 0 0 0
A2 0 1 0 0 0
A3 0 0 0 0 0
A4 0 0 0 1 0
A5 0 0 0 0 1
TABLE 2
Illustratively, based on the above table 2, among the above 5 samples, the label of sample 3 (A3, B3) is 0, indicating that the sample includes two semantically dissimilar text compositions. The other 4 samples were each composed of two semantically similar texts.
Illustratively, based on the above tables 1 and 2, the similarity matrix and the tag matrix are composed of the same sequence of texts, i.e., the values of both matrices are calculated from the same sequence of texts.
S203, obtaining a loss function value of the target Batch based on the similarity matrix and the label matrix, and adjusting parameters of a target model based on the loss function value.
For example, after the above-described similarity matrix and the tag matrix are acquired, the loss function value of the target Batch may be calculated from the two matrices. Also, since Batch is used to define the number of samples to be processed before updating the internal model parameters, after acquiring the loss function of the target Batch, the target model can adjust the parameters of the target model based on the loss function value.
It will be appreciated that since the function value of the loss function may be used to represent the degree of error between the predicted value and the actual value during model training, the smaller the loss function value, the better the model fit, and thus the model may adjust the parameters of the model based on the loss function value.
Therefore, texts in a plurality of samples contained in each Batch in the training sample set are obtained, and the text similarity between every two texts in the texts is calculated, so that the utilization of the samples is improved, more negative samples can be learned by the model, and the generalization performance of the trained model is improved. Meanwhile, as each sample of each Batch is not mutually independent, the training speed and the training efficiency of the model are greatly improved.
In some ways, since the texts in different samples are subjected to text similarity matching in the step S202, before the similarity matrix and the label matrix are constructed, the labels of the new samples composed of every two texts need to be determined.
Illustratively, before the step S203, as shown in fig. 3, the data processing method provided in the embodiment of the present application may further include the following steps S202b1 and S202b2:
S202b1, if the similarity between the first text and the second text is greater than a preset similarity, determining the first text and the second text as positive samples.
S202b2, if the similarity between the third text and the fourth text is smaller than or equal to the preset similarity, determining a negative sample between the third text and the fourth text.
The first text, the second text, the third text and the fourth text are all texts in the N samples.
For example, the preset similarity may be a super parameter of the target model, that is, preset before the model starts training.
Illustratively, for the positive samples, their values in the tag matrix may be 1, and for the negative samples, their values in the tag matrix may be 0.
If any row in the tag matrix includes, in addition to the N samples, other samples with tag values of 1, the prediction result error of the sample is large, and the sample can be processed. The processing may include deleting the row in which the just-sampled sample is located, or ignoring the sample, etc.
The positive samples refer to samples belonging to a certain class, and the negative samples refer to samples not belonging to the certain class. For example, in the image recognition process of the letter a, a sample similar to the letter a belongs to a positive sample, and vice versa belongs to a negative sample.
In the related art, text similarity calculation is generally performed on only two texts included in a single sample in Batch, resulting in low negative sample utilization. For example, for text in a certain sample, although a plurality of texts different from the text are included in other samples, they do not participate in the training process. In the embodiment of the application, for a plurality of samples in the same Batch, the text similarity among the samples is calculated respectively and participates in the training process, so that the utilization rate of the negative sample is greatly improved, and the generalization performance of the trained model is enhanced.
In some ways, in order to reduce the influence of the sample expansion on the SoftMax activation function in the process of obtaining the loss function value, auxiliary columns need to be added in the similarity matrix and the label matrix.
Illustratively, the step S202 may include steps S202c1 to S202c3 shown in fig. 4, and steps 202d1 to 202d3 shown in fig. 5:
S202c1, constructing a first auxiliary column of the similarity matrix according to the label of each sample in the N samples.
S202c2, if the first target row includes a positive sample, setting a value of the first auxiliary column in the first target row to 0.
S202c3, if the first target row does not have a positive sample, setting the value of the first auxiliary column in the first target row as the preset similarity.
Wherein the first target acts on any row in the similarity matrix.
Illustratively, taking the above-mentioned preset similarity of 0.5 as an example, the similarity matrix after adding the auxiliary columns is as follows
Table 3 shows:
B1 B2 B3 B4 B5 auxiliary column
A1 0.9 0.1 0.2 0.4 0.3 0
A2 0.2 0.7 0.1 0.4 0.3 0
A3 0.1 0.2 0.4 0.3 0.1 0.5
A4 0.3 0.3 0.3 0.9 0.2 0
A5 0.2 0.2 0.2 0.1 0.8 0
TABLE 3 Table 3
S202d1, constructing a second auxiliary column of the label matrix according to the labels of each sample in the N samples.
S202d2, if the second target row includes a positive sample, setting a value of the second auxiliary column in the second target row to 0.
And S202d3, if the second target row has no positive sample, setting the value of the second auxiliary column in the second target row to be 1.
Wherein the second target acts on any row in the tag matrix.
Illustratively, the tag matrix with the auxiliary columns added may be as shown in the following table 4:
TABLE 4 Table 4
It should be noted that, since a SoftMax activation function is required to be used in the process of calculating the loss function value of the target Batch, the activation function can enable the probability of a sample with a larger probability to be larger in the samples with a probability of 1 added to each line in the matrix, and otherwise, the probability of the sample is smaller. Therefore, if a certain row in the matrix includes two similar text groups, or if no similar text exists in a certain row, the influence on the calculation result is greater.
For example, based on the above description, for the similarity matrix, if a positive sample exists in a certain row of the matrix, the value of the auxiliary column of the row is 0, otherwise, if no positive sample exists in the row, the value of the auxiliary column of the row is the preset similarity. For the tag matrix, if a positive sample exists in a certain row of the matrix, the value of the auxiliary column pair of the row is 0, otherwise, if no positive sample exists in the row, the value of the auxiliary column of the row is 1.
In one implementation, for the tag matrix, only the tag of each of the N samples may be considered, that is, in the tag matrix, the value corresponding to each of the N samples is the tag value thereof, and the values corresponding to the other samples are 0.
Therefore, by setting auxiliary columns for the similarity matrix and the label matrix, the loss function value of the target Batch is more accurate, errors between the predicted result and the real result can be reflected, and further the target model is adjusted more accurately.
In some modes, for the calculation process of the loss function value of the target Batch, the loss function value of the target Batch can be obtained by calculating the cross entropy of the similarity matrix and the label matrix.
Illustratively, as shown in fig. 6, the above step S203 may include the following steps S203a1 and S203a2:
And S203a1, adjusting the value of each row in the similarity matrix based on the target adjustment coefficient, wherein the sum of the similarity included in each row of the adjusted similarity matrix is close to 1.
S203a2, acquiring cross entropy of the adjusted similarity matrix and the label matrix, and obtaining a loss function value of the target Batch based on the cross entropy.
Illustratively, the loss function value for the target Batch may be derived based on the following equation:
loss= Σ CrossEntropy (SoftMax (a×p_pred), p_true) formula one
Wherein, the LOSS function value of the target Batch is CrossEntropy, the cross entropy of the similarity matrix and the label matrix, a is the target adjustment coefficient, p_pred is the similarity matrix, and p_true is the label matrix.
Illustratively, the SoftMax activation function described above requires the separate computation of the value of each row in the similarity matrix, i.e., p_pred (i), where i represents the ith row of the similarity matrix.
Since the sum of the input parameters of the SoftMax activation function is close to 1, and the difference between the sum of the values of each row and 1 is large due to the influence of the on-duty prediction accuracy of each row in the similarity matrix, the sum of the values of each row needs to be adjusted by the target adjustment coefficient. And the preset adjustment coefficients corresponding to each row are not necessarily the same.
Therefore, the influence on the SoftMax activation function is reduced by introducing the target adjustment coefficient to adjust the similarity matrix, so that the accuracy of obtaining the loss function value of the target Batch based on the cross entropy of the adjusted similarity matrix and the label matrix is higher.
In some ways, after the activation function value of the target Batch is obtained, the target model may adjust the parameters of the target model based on the activation function value.
For example, the specific adjustment manner is as follows, and the step of adjusting the parameter of the target model based on the loss function value in the step S203 may include the following step S203b as shown in fig. 7:
S203b, generating an adjustment threshold based on a random gradient descent method under the condition that the target Batch is a negative sample and the label of the target Batch is 1, and adjusting parameters of the target model according to the adjustment threshold so as to adjust the similarity of two texts in the target Batch.
If the predicted similarity of the target Batch is smaller than the preset similarity, and the label of the target Batch is 1, the predicted result of the target Batch is indicated to have larger deviation, at this time, the predicted similarity of the target Batch by the target model can be enabled to be closer to the label value of the target Batch and be closer to the real situation by adjusting the parameter of the target model, and particularly, the parameter of the target model can be adjusted by adopting an adjustment method based on random gradient descent, and the similarity of the predicted results of two texts in the predicted target Batch of the model is increased to be enabled to be closer to the label value of the target text. The adjustment logic may be adjusted based on common knowledge of a person skilled in the art.
According to the data processing method provided by the embodiment, the texts in the plurality of samples contained in each Batch in the training sample set are obtained, the text similarity between every two texts in the texts is calculated, the similarity matrix and the label matrix are constructed based on the similarity and the labels of the samples, and then the loss function value of each Batch is calculated according to the two matrices, so that the utilization of the samples is greatly improved, more negative samples can be learned by the model, and the generalization performance of the trained model is improved. Meanwhile, as each sample of each Batch is not mutually independent, the training speed and the training efficiency of the model are greatly improved. And the auxiliary columns are added to the matrix based on the labels of the samples, so that the calculation result of the loss function value of the Batch is more accurate, the parameters of the target model can be optimized, and the generalization performance of the model after training is improved.
It should be noted that, in the data processing method provided in the embodiment of the present application, the execution body may be a data processing apparatus, or a control module in the data processing apparatus for executing the data processing method. In the embodiment of the present application, a data processing device is described by taking a data processing method performed by the data processing device as an example.
In the embodiment of the present application, the data processing methods shown in the foregoing method drawings are all illustrated by way of example in conjunction with one drawing in the embodiment of the present application. In specific implementation, the data processing method shown in the foregoing method drawings may also be implemented in combination with any other drawing that may be illustrated in the foregoing embodiments, and will not be described herein.
Referring to fig. 8, fig. 8 is a schematic diagram illustrating a basic structure of a data processing apparatus according to the present embodiment.
As shown in fig. 8, a data processing apparatus includes: an obtaining module 801, configured to obtain a text similarity of every two texts in a target Batch, where the target Batch includes N samples, each sample includes two texts, and N is a positive integer; a construction module 802, configured to construct a similarity matrix according to the text similarity of each two texts and the label of each sample in the N samples, which are acquired by the acquisition module 801, and construct a label matrix according to the label of each sample in the N samples; an adjusting module 803, configured to obtain a loss function value of the target Batch based on the similarity matrix and the tag matrix constructed by the constructing module 802, and adjust a parameter of a target model based on the loss function value; the similarity matrix is the same as the label matrix in number of rows and columns and is obtained based on texts of the same sequence; one row of the similarity matrix comprises two text similarities in the same sample; one row of the tag matrix includes a tag of one sample.
In some manners, the obtaining module 801 is specifically configured to obtain a token vector of each text in the N samples; the obtaining module 801 is specifically further configured to obtain a text similarity of each two texts based on the token vector of each text.
In some aspects, the apparatus further comprises: a determination module 804; a determining module 804, configured to determine the first text and the second text as positive samples if the similarity between the first text and the second text is greater than a preset similarity; the determining module 804 is further configured to determine a negative sample for the third text and the fourth text if the similarity between the third text and the fourth text is less than or equal to a preset similarity; the first text, the second text, the third text and the fourth text are all texts in the N samples.
In some manners, a construction module 802 is specifically configured to construct a first auxiliary column of the similarity matrix according to the label of each sample in the N samples; the construction module 802 is specifically further configured to set a value of the first auxiliary column in the first target row to 0 if the first target row includes a positive sample; the construction module 802 is specifically further configured to set, if the first target row does not have a positive sample, a value of the first auxiliary column in the first target row to be the preset similarity; wherein the first target acts on any row in the similarity matrix.
In some manners, a construction module 802 is specifically configured to construct a second auxiliary column of the label matrix according to the label of each sample in the N samples; the construction module 802 is specifically further configured to set a value of the first auxiliary column in the second target row to 0 if the second target row includes a positive sample; the construction module 802 is specifically further configured to set a value of the first auxiliary column in the second target row to 1 if the second target row has no positive sample; wherein the second target acts on any row in the tag matrix.
In some manners, the adjusting module 803 is specifically configured to adjust a value of each row in the similarity matrix based on the target adjustment coefficient, where a sum of similarities included in each row of the adjusted similarity matrix is close to 1; the adjusting module 803 is specifically further configured to obtain cross entropy of the adjusted similarity matrix and the tag matrix, and obtain a loss function value of the target Batch based on the cross entropy.
In some manners, the adjusting module 803 is specifically configured to generate an adjustment threshold based on a random gradient descent method when the target Batch is a negative sample and the label of the target Batch is 1, and adjust parameters of the target model according to the adjustment threshold to adjust the similarity of two texts in the target Batch.
The data processing device in the embodiment of the application can be a device, or can be a component, an integrated circuit, or a chip in a terminal. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), etc., and the non-mobile electronic device may be a server, a network attached storage (Network Attached Storage, NAS), a personal computer (personal computer, PC), a Television (TV), a teller machine, a self-service machine, etc., and the embodiments of the present application are not limited in particular.
The data processing device provided by the embodiment of the application can be used for realizing the method embodiment of fig. 1 to 7. The respective processes implemented by the apparatus are not described here in detail to avoid repetition.
The beneficial effects of the various implementation manners in this embodiment may be specifically referred to the beneficial effects of the corresponding implementation manners in the foregoing method embodiment, and in order to avoid repetition, the description is omitted here.
According to the data processing device provided by the embodiment of the application, the texts in the plurality of samples contained in each Batch in the training sample set are obtained, the text similarity between every two texts in the texts is calculated, the similarity matrix and the label matrix are constructed based on the similarity and the labels of the samples, and then the loss function value of each Batch is calculated according to the two matrices, so that the utilization of the samples is greatly improved, more negative samples can be learned by a model, and the generalization performance of the trained model is improved. Meanwhile, as each sample of each Batch is not mutually independent, the training speed and the training efficiency of the model are greatly improved. And the auxiliary columns are added to the matrix based on the labels of the samples, so that the calculation result of the loss function value of the Batch is more accurate, the parameters of the target model can be optimized, and the generalization performance of the model after training is improved.
In order to solve the technical problems, the embodiment of the invention also provides computer equipment. Referring specifically to fig. 9, fig. 9 is a basic structural block diagram of a computer device according to the present embodiment.
As shown in fig. 9, the internal structure of the computer device is schematically shown. The computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The nonvolatile storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store a control information sequence, and the computer readable instructions can enable a processor to realize a data processing method when the computer readable instructions are executed by the processor. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform a data processing method. The network interface of the computer device is for communicating with a terminal connection. It will be appreciated by those skilled in the art that the structure shown in FIG. 8 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
The processor in this embodiment is configured to perform specific functions of the acquisition module 801, the construction module 802, and the adjustment module 803 in fig. 8, and the memory stores program codes and various types of data required for executing the above modules. The network interface is used for data transmission between the user terminal or the server. The memory in this embodiment stores program codes and data necessary for executing all the sub-modules in the data processing apparatus, and the server can call the program codes and data of the server to execute the functions of all the sub-modules.
According to the computer equipment provided by the embodiment, the texts in the plurality of samples contained in each Batch in the training sample set are obtained, the text similarity between every two texts in the texts is calculated, the similarity matrix and the label matrix are constructed based on the similarity and the labels of the samples, and then the loss function value of each Batch is calculated according to the two matrices, so that the utilization of the samples is greatly improved, more negative samples can be learned by the model, and the generalization performance of the trained model is improved. Meanwhile, as each sample of each Batch is not mutually independent, the training speed and the training efficiency of the model are greatly improved. And the auxiliary columns are added to the matrix based on the labels of the samples, so that the calculation result of the loss function value of the Batch is more accurate, the parameters of the target model can be optimized, and the generalization performance of the model after training is improved.
The invention also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of any of the data processing methods of the embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
The invention also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of any of the data processing methods of the embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above-described embodiment methods may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
Those of skill in the art will appreciate that the various operations, methods, steps in the flow, acts, schemes, and alternatives discussed in the present application may be alternated, altered, combined, or eliminated. Further, other steps, means, or steps in a process having various operations, methods, or procedures discussed herein may be alternated, altered, rearranged, disassembled, combined, or eliminated. Further, steps, measures, schemes in the prior art with various operations, methods, flows disclosed in the present application may also be alternated, altered, rearranged, decomposed, combined, or deleted.
The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims (8)

1. A method of data processing, comprising:
obtaining the text similarity of every two texts in a target Batch, wherein the target Batch comprises N samples, each sample comprises two texts, and N is a positive integer;
constructing a similarity matrix according to the text similarity of each two texts and the labels of each sample in the N samples, and constructing a label matrix according to the labels of each sample in the N samples;
Obtaining a loss function value of the target Batch based on the similarity matrix and the label matrix, and adjusting parameters of a target model based on the loss function value;
The similarity matrix is the same as the label matrix in number of rows and columns and is obtained based on texts of the same sequence; one row of the similarity matrix comprises two text similarities contained in one and the same sample; a row of the tag matrix includes a tag of one sample;
Before the similarity matrix is constructed according to the text similarity construction of every two texts and the label construction of each sample in the N samples, and the label matrix is constructed according to the label construction of each sample in the N samples, the method further includes:
if the similarity between the first text and the second text is greater than the preset similarity, determining the first text and the second text as positive samples;
if the similarity of the third text and the fourth text is smaller than or equal to the preset similarity, determining a negative sample from the third text and the fourth text;
wherein the first text, the second text, the third text, and the fourth text are all text in the N samples;
the constructing a similarity matrix according to the text similarity construction of every two texts and the label of each sample in the N samples comprises the following steps:
constructing a first auxiliary column of the similarity matrix according to the label of each sample in the N samples;
if the first target row comprises a positive sample, setting the value of the first auxiliary column in the first target row to 0;
If the first target row does not have a positive sample, setting the value of the first auxiliary column in the first target row as the preset similarity;
Wherein the first target acts on any row in the similarity matrix.
2. The method of claim 1, wherein the obtaining the text similarity for each two texts in the target Batch comprises:
Acquiring a characterization vector of each text in N samples;
and obtaining the text similarity of each two texts based on the characterization vector of each text.
3. The method of claim 1, wherein constructing a tag matrix from the tags of each of the N samples comprises:
Constructing a second auxiliary column of the label matrix according to the label of each sample in the N samples;
if the second target row comprises a positive sample, setting the value of the second auxiliary column in the second target row to 0;
If the second target row does not have a positive sample, setting the value of the second auxiliary column in the second target row to be 1;
Wherein the second target acts on any row in the tag matrix.
4. The method of claim 3, wherein the calculating the loss function value for the target Batch based on the similarity matrix and the tag matrix comprises:
Adjusting the value of each row in the similarity matrix based on a target adjustment coefficient, wherein the sum of the similarity included in each row of the adjusted similarity matrix is close to 1;
And acquiring cross entropy of the adjusted similarity matrix and the label matrix, and acquiring a loss function value of the target Batch based on the cross entropy.
5. The method according to any one of claims 1, 3 to 4, wherein said adjusting parameters of a target model based on said loss function value comprises:
And under the condition that the target Batch is a negative sample and the label of the target Batch is 1, generating an adjustment threshold based on a random gradient descent method, and adjusting parameters of the target model according to the adjustment threshold so as to adjust the similarity of two texts in the target Batch.
6. A data processing apparatus, comprising:
The acquisition module is used for acquiring the text similarity of every two texts in the target Batch, wherein the target Batch comprises N samples, each sample comprises two texts, and N is a positive integer;
The construction module is used for constructing a similarity matrix according to the text similarity of each two texts and the labels of each sample in the N samples, which are acquired by the acquisition module, and constructing a label matrix according to the labels of each sample in the N samples;
The adjustment module is used for obtaining a loss function value of the target Batch based on the similarity matrix constructed by the construction module and the label matrix, and adjusting parameters of a target model based on the loss function value;
The similarity matrix is the same as the label matrix in number of rows and columns and is obtained based on texts of the same sequence; one row of the similarity matrix comprises two text similarities in the same sample; a row of the tag matrix includes a tag of one sample;
the apparatus further comprises: a determining module; the determining module is configured to determine the first text and the second text as positive samples if the similarity between the first text and the second text is greater than a preset similarity; the determining module is further configured to determine a negative sample from the third text and the fourth text if the similarity between the third text and the fourth text is less than or equal to a preset similarity; wherein the first text, the second text, the third text, and the fourth text are all text in the N samples;
The construction module is specifically configured to construct a first auxiliary column of the similarity matrix according to the label of each sample in the N samples; if the first target row comprises a positive sample, setting the value of the first auxiliary column in the first target row to 0; if the first target row does not have a positive sample, setting the value of the first auxiliary column in the first target row as the preset similarity; wherein the first target acts on any row in the similarity matrix.
7. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the data processing method of any of claims 1 to 5.
8. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the data processing method of any of claims 1 to 5.
CN202110700948.8A 2021-06-23 2021-06-23 Data processing method, device, computer equipment and storage medium Active CN113434671B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110700948.8A CN113434671B (en) 2021-06-23 2021-06-23 Data processing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110700948.8A CN113434671B (en) 2021-06-23 2021-06-23 Data processing method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113434671A CN113434671A (en) 2021-09-24
CN113434671B true CN113434671B (en) 2024-06-07

Family

ID=77753722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110700948.8A Active CN113434671B (en) 2021-06-23 2021-06-23 Data processing method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113434671B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729513A (en) * 2017-10-25 2018-02-23 鲁东大学 Discrete supervision cross-module state Hash search method based on semanteme alignment
CN110059198A (en) * 2019-04-08 2019-07-26 浙江大学 A kind of discrete Hash search method across modal data kept based on similitude
WO2019153551A1 (en) * 2018-02-12 2019-08-15 平安科技(深圳)有限公司 Article classification method and apparatus, computer device and storage medium
CN111160398A (en) * 2019-12-06 2020-05-15 重庆邮电大学 A missing-label multi-label classification method based on instance-level and label-level associations
CN111797700A (en) * 2020-06-10 2020-10-20 南昌大学 A vehicle re-identification method based on fine-grained discriminant network and second-order reordering
CN111914908A (en) * 2020-07-14 2020-11-10 浙江大华技术股份有限公司 Image recognition model training method, image recognition method and related equipment
CN112199520A (en) * 2020-09-19 2021-01-08 复旦大学 Cross-modal Hash retrieval algorithm based on fine-grained similarity matrix

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11288297B2 (en) * 2017-11-29 2022-03-29 Oracle International Corporation Explicit semantic analysis-based large-scale classification

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107729513A (en) * 2017-10-25 2018-02-23 鲁东大学 Discrete supervision cross-module state Hash search method based on semanteme alignment
WO2019153551A1 (en) * 2018-02-12 2019-08-15 平安科技(深圳)有限公司 Article classification method and apparatus, computer device and storage medium
CN110059198A (en) * 2019-04-08 2019-07-26 浙江大学 A kind of discrete Hash search method across modal data kept based on similitude
CN111160398A (en) * 2019-12-06 2020-05-15 重庆邮电大学 A missing-label multi-label classification method based on instance-level and label-level associations
CN111797700A (en) * 2020-06-10 2020-10-20 南昌大学 A vehicle re-identification method based on fine-grained discriminant network and second-order reordering
CN111914908A (en) * 2020-07-14 2020-11-10 浙江大华技术股份有限公司 Image recognition model training method, image recognition method and related equipment
CN112199520A (en) * 2020-09-19 2021-01-08 复旦大学 Cross-modal Hash retrieval algorithm based on fine-grained similarity matrix

Also Published As

Publication number Publication date
CN113434671A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN112115267B (en) Training method, device, equipment and storage medium of text classification model
US11544474B2 (en) Generation of text from structured data
US20230351149A1 (en) Contrastive captioning neural networks
CN111222305B (en) Information structuring method and device
CN110765785B (en) Chinese-English translation method based on neural network and related equipment thereof
CN115146068B (en) Extraction method, device, equipment and storage medium for relational triples
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
CN112101042B (en) Text emotion recognition method, device, terminal equipment and storage medium
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CN111583911B (en) Speech recognition method, device, terminal and medium based on label smoothing
US11615247B1 (en) Labeling method and apparatus for named entity recognition of legal instrument
CN112613293A (en) Abstract generation method and device, electronic equipment and storage medium
CN113157900A (en) Intention recognition method and device, computer equipment and storage medium
CN114218945A (en) Entity identification method, device, server and storage medium
CN110019795A (en) The training method and system of sensitive word detection model
CN114861635B (en) Chinese spelling error correction method, device, equipment and storage medium
US12499326B2 (en) Multi-model joint denoising training
CN112287669B (en) Text processing method and device, computer equipment and storage medium
CN115641480A (en) A Noise Dataset Training Method Based on Sample Screening and Label Correction
CN113434671B (en) Data processing method, device, computer equipment and storage medium
CN119167092A (en) Debiasing method, system, storage medium and terminal for pre-trained language model based on LLM
CN114757189B (en) Event extraction method and device, intelligent terminal and storage medium
CN118585779A (en) A robustness evaluation method for deep learning models with soft label output based on ORS
CN114036260B (en) Sensitive word determination method, device, equipment, storage medium and program product
CN110928987A (en) Legal provision retrieval method based on neural network hybrid model and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant