CN113761124B

CN113761124B - Training method of text coding model, information retrieval method and equipment

Info

Publication number: CN113761124B
Application number: CN202110572323.8A
Authority: CN
Inventors: 欧子菁; 赵瑞辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2024-04-26
Anticipated expiration: 2041-05-25
Also published as: CN113761124A

Abstract

The embodiment of the application discloses a training method, an information retrieval method and equipment for a text coding model, and belongs to the technical field of machine learning. The method comprises the following steps: inputting sample texts in the text relation network into a text coding model to obtain sample feature vectors corresponding to the sample texts; determining a model loss based on the sample feature vector and the objective function; and performing iterative training on the text coding model based on the model loss. Acquiring search information based on a text search operation in response to the text search operation; inputting the search information into a text coding model to obtain a search information feature vector corresponding to the search information; determining a target text from a text library based on the retrieval information feature vector; and displaying the target text through a search result display interface. The modeling is performed based on the network relation of the sample text, and meanwhile, under the conditions of sparse network edges and more noise of the text relation network, the model can obtain more accurate vector representation by capturing semantic information of the text.

Description

Training method of text coding model, information retrieval method and equipment

Technical Field

The embodiment of the application relates to the technical field of machine learning, in particular to a training method of a text coding model, an information retrieval method and equipment.

Background

Information retrieval is a relatively frequent operation used in daily life, such as paper retrieval, news retrieval, and medical consultation retrieval. The user inputs the key words or key sentences in the search box, the terminal searches the content related to the key words or key sentences from the document library according to the document searching rule, and the search result is displayed for the user to check.

The related art generally encodes text content input by a user into a continuous vector, calculates similarity between the continuous vector and vector representations of various documents in a document library by using a model, and determines a search result based on a vector distance. For the training process of the model, the related technology adopts a mode of contrast learning to construct the model, and positive and negative samples are utilized to maximize likelihood functions.

However, the above-mentioned contrast learning focuses more on the relationship between each node in the text network, and when the text network edge is sparse and the edge noise is large, the model performance is reduced; and the method needs that the vector inner product between positive samples is as large as possible, the vector inner product between negative samples is as small as possible, and if the negative samples cannot be properly selected, the method has a great negative effect on the model performance.

Disclosure of Invention

The embodiment of the application provides a training method, an information retrieval method and equipment for a text coding model, which can improve the text coding performance of the text coding model and improve the information retrieval accuracy. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a training method for a text coding model, where the method includes:

Inputting sample texts in a text relation network into a text coding model to obtain sample feature vectors corresponding to each sample text, wherein the text relation network is an undirected graph taking the sample texts as nodes and connecting lines among neighbor nodes as edges, and the neighbor nodes have the same text attribute;

Determining model loss based on the sample feature vectors and an objective function, wherein the objective function comprises a first function term and a second function term, the first function term is used for representing the representation quality of the sample feature vectors on semantic information in the sample text, and the second function term is used for representing the simulation quality of correlation among the sample feature vectors on the text relation network;

And carrying out iterative training on the text coding model based on the model loss.

In another aspect, an embodiment of the present application provides an information retrieval method, where the method includes:

Responding to a text retrieval operation, and acquiring retrieval information based on the text retrieval operation;

Inputting the search information into a text coding model to obtain a search information feature vector corresponding to the search information, wherein the text coding model is a model which is obtained by training by taking an objective function as a training target and based on a text relation network, the text relation network is an undirected graph which takes texts as nodes and takes relations among the texts as edges, the objective function comprises a first function item and a second function item, the first function item is used for representing the representation quality of the sample feature vector on semantic information in a sample text, and the second function item is used for representing the simulation quality of the correlation among the sample feature vectors on the text relation network;

determining target text from a text library based on the retrieval information feature vector, wherein the target text is text with correlation with the retrieval information;

And displaying the target text through a search result display interface.

In another aspect, an embodiment of the present application provides a training device for a text coding model, where the device includes:

The text relation network is an undirected graph taking the sample text as a node and taking a connecting line between adjacent nodes as an edge, and the adjacent nodes have the same text attribute;

A first determining module, configured to determine a model loss based on the sample feature vectors and an objective function, where the objective function includes a first function term and a second function term, the first function term is used to characterize a representation quality of the sample feature vectors on semantic information in the sample text, and the second function term is used to characterize a simulation quality of correlations between the sample feature vectors on the text relationship network;

And the training module is used for carrying out iterative training on the text coding model based on the model loss.

In another aspect, an embodiment of the present application provides an information retrieval apparatus, including:

the acquisition module is used for responding to the text retrieval operation and acquiring retrieval information based on the text retrieval operation;

The second input module is used for inputting the search information into a text coding model to obtain a search information feature vector corresponding to the search information, the text coding model is a model which is obtained by training by taking an objective function as a training target and based on a text relation network, the text relation network is an undirected graph which takes texts as nodes and takes the relation between the texts as edges, the objective function comprises a first function item and a second function item, the first function item is used for representing the representation quality of the sample feature vector on semantic information in a sample text, and the second function item is used for representing the simulation quality of the correlation between the sample feature vectors on the text relation network;

A second determining module, configured to determine a target text from a text library based on the feature vector of the search information, where the target text is a text having a correlation with the search information;

and the display module is used for displaying the target text through a search result display interface.

In another aspect, the present application provides a computer device comprising a processor and a memory; the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement a training method of the text encoding model or an information retrieval method as described in the above aspect.

In another aspect, the present application provides a computer readable storage medium having at least one computer program stored therein, the computer program being loaded and executed by a processor to implement a training method for a text encoding model, or an information retrieval method, as described in the above aspects.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to implement the training method of the text encoding model provided in various alternative implementations of the above aspects, or the information retrieval method.

The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:

In the embodiment of the application, the objective function can restrict the representation condition of the sample characteristic vector on the sample text content, can restrict the text coding model to ensure that the correlation between the sample characteristic vectors accords with a text relation network, trains the text coding model through the objective function, can model based on the network relation of the sample text, and simultaneously ensures that the model can obtain more accurate vector representation by capturing the semantic information of the text under the conditions of sparse network side and more noise of the text relation network, thereby improving the text coding performance of the text coding model. In the model application stage, the text coding model is utilized to code the search information, and the target file is queried and displayed based on the obtained characteristic vector of the search information, so that the information search efficiency and accuracy are improved.

Drawings

FIG. 1 is a schematic diagram of a text relationship network in the related art;

FIG. 2 is a flow chart of a training method for a text encoding model provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a training method for a text encoding model provided by another exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of a text relationship network provided by an exemplary embodiment of the present application;

FIG. 5 is a flow chart of a method of information retrieval provided by an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a search result presentation interface provided by an exemplary embodiment of the present application;

FIG. 7 is a flowchart of a retrieval process provided by an exemplary embodiment of the present application;

FIG. 8 is a block diagram of a training device for a text encoding model provided in an exemplary embodiment of the present application;

FIG. 9 is a block diagram of an information retrieval apparatus provided in an exemplary embodiment of the present application;

fig. 10 is a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

In the related art, when a computer device receives a text retrieval operation, text content input by a user is generally encoded into a continuous vector, and then similarity between the continuous vector and a vector representation of each document in a document library is calculated by using a model, so that a retrieval result is determined based on a vector distance. For the training process of the model, training is usually performed based on a learning graph structure learning method, and the method mainly comprises two parts: the vector representation of the learning structure and the vector representation of the learning text are spliced by two parts. Where the vector representation of the structure is intended to capture topology information of the network, as shown in fig. 1, nodes a and B are connected and nodes a and C are not connected, the distance represented by the vector of a and B will be smaller than the distance represented by the vector of a and C. In learning a vector representation of text, if the text properties of two nodes are similar, their vector representations should also be similar. As shown in fig. 1, the text properties of a and B, a and C are similar, so should the distances represented by their vectors. In order to capture the two kinds of information, the related art scheme adopts a contrast learning method. Specifically, it is the maximum likelihood function:

However, since the number of network nodes is too large, the operation of directly calculating the equation is very large, so it is generally approximated as:

The first term of the equation is positive sampling and the second term is negative sampling, and the related art generally expects that the vector inner product between positive samples is as large as possible and the inner product between negative samples is as small as possible. However, this approach has two disadvantages: the data enhancement is very important, and whether the proper negative sample can be selected has great influence on the performance of the model; this approach typically suffers from a significant degradation when the number of network edges is sparse, edge noise is large, or there are unseen nodes.

In order to solve the technical problems, the application provides a training method and an information retrieval method for a text coding model, which are used for training the text coding model through an objective function, modeling can be performed based on a network relation of a sample text, and meanwhile, under the conditions that the network side of a text relation network is sparse and has more noise, the model can obtain more accurate vector representation by capturing semantic information of the text, so that the text coding performance of the text coding model is improved.

The application scenarios of the training method and the information retrieval method of the text coding model provided by the embodiment of the application are schematically described below.

1. Providing information retrieval services for users

When the training method and the information retrieval method of the text coding model provided by the embodiment of the application are utilized to provide information retrieval service for users, the information retrieval method can be realized to be an independent information retrieval program and is installed in computer equipment or a background server for providing information retrieval service.

In this scenario, the user inputs information (e.g., keywords, etc.) that the user wants to query into the computer device, which determines the target text using the text encoding model based on the search information, or sends the search information to the background server, which determines the target text and returns the target text to the medical search result presentation interface.

2. Assisting a user in disease prediction and treatment

When the training method and the information retrieval method of the text coding model provided by the embodiment of the application help a user to predict diseases, the method can be realized to be an independent online diagnosis application program or a health application program, and is installed in computer equipment used by the user or a background server providing medical text search service, so that the user can conveniently use the program to inquire diseases.

In the scene, a user inputs symptoms or disease names in an application program interface, computer equipment inputs contents input by the user such as the symptoms, the disease names and the like into a text coding model to obtain corresponding retrieval information feature vectors, so that a target text is obtained by inquiring from a medical text library based on the retrieval information feature vectors, and the retrieval result is returned to the corresponding application program interface to prompt the user of possible diseases or treatment methods.

Of course, besides being applied to the above-mentioned scenes, the method provided by the embodiment of the present application may also be applied to other scenes that need information retrieval, and the embodiment of the present application is not limited to a specific application scene.

The training method and the information retrieval method of the text coding model provided by the embodiment of the application can be applied to computer equipment such as a terminal or a server. In a possible implementation manner, the information retrieval method provided by the embodiment of the application can be implemented as an application program or a part of the application program and is installed in the terminal, so that the terminal has the function of performing text search according to the retrieval information; the training method of the text coding model provided by the embodiment of the application can be applied to a background server of an application program, so that the server carries out model training and updating. For convenience of description, in the following embodiments, a training method of a text coding model and an information retrieval method are described as examples applied to a computer device, but this is not a limitation.

Fig. 2 shows a flowchart of a training method of a text encoding model according to an exemplary embodiment of the present application. This embodiment will be described by taking the method for a computer device as an example, and the method includes the following steps.

Step 201, inputting sample texts in a text relation network into a text coding model to obtain sample feature vectors corresponding to each sample text, wherein the text relation network is an undirected graph taking the sample texts as nodes and connecting lines among neighboring nodes as edges, and the neighboring nodes have the same text attribute.

The text relation network is an undirected graph taking a sample text as a node and taking a connecting line between adjacent nodes as an edge. Wherein the Graph is composed of a finite non-empty set of vertices (Vertex) and a set of edges (edges) between the vertices, generally denoted as G (V, E), where G represents a Graph, V is the set of vertices in the Graph G, E is the set of edges in the Graph G, and undirected Graph is a Graph in which edges between any two vertices in the Graph are undirected edges.

In one possible implementation, before performing model training, the computer device first connects nodes with the same text attribute based on the obtained sample text, making them neighbor nodes, thereby forming a text relationship network. Each node in the text relationship network has its own text attribute, which is extracted by content, for example, for articles in the medical field, the computer device generates the text attribute of the sample text by extracting entities such as symptoms, medicines, disease names, etc. in the articles.

The computer equipment inputs the text relation network into the text coding model to perform model training, so that the model can learn based on text content of sample texts and correlation between the sample texts in the text relation network, and feature vectors generated by the model can represent the text content and topological structure information between the texts.

Step 202, determining model loss based on sample feature vectors and an objective function, wherein the objective function comprises a first function item and a second function item, the first function item is used for representing the representation quality of the sample feature vectors on semantic information in sample texts, and the second function item is used for representing the simulation quality of correlation among the sample feature vectors on a text relation network.

In one possible implementation, the text encoding model is essentially an encoder, encodes the text input to the model, outputs a feature vector for characterizing the text, and in order to further improve the encoding performance of the model, enables the feature vector to better represent the text content, and the computer device calculates model losses based on the sample feature vector output by the model and the objective function, so as to update the model parameters reversely based on the model losses.

In order to realize that the feature vectors output by the model can represent text content and topological structure information among texts, the computer equipment constructs an objective function based on the sample text and a text relation network, so that the generated objective function contains a first function item for representing the representation quality of the sample feature vectors on semantic information in the sample text and a second function item for representing the simulation quality of correlation among the sample feature vectors on the text relation network. The higher the expression quality is, the more accurate and sufficient the expression of the sample feature vector on the semantic information in the sample text is shown, namely the higher the quality of the sample feature vector is; the higher the simulation quality, the higher the matching degree of the correlation between the sample feature vectors and the text correlation in the text relation network, i.e. the higher the accuracy of representing the correlation between texts by the correlation between the sample feature vectors.

And 203, performing iterative training on the text coding model based on the model loss.

And the computer equipment calculates model loss based on the objective function, finishes model training when the model loss meets the training ending condition, reversely adjusts model parameters based on the model loss if the model loss does not meet the training ending condition, and carries out next training based on the model with updated parameters until the model converges.

In summary, in the embodiment of the application, the objective function can restrict the representation of the sample feature vector to the sample text content, and also restrict the text coding model to make the correlation between the sample feature vectors conform to the text relation network, training the text coding model by the objective function can model based on the network relation of the sample text, and simultaneously, the model can obtain more accurate vector representation by capturing the semantic information of the text under the conditions of sparse network side and more noise of the text relation network, thereby improving the text coding performance of the text coding model.

Fig. 3 is a flowchart illustrating a training method of a text encoding model according to another exemplary embodiment of the present application. This embodiment will be described by taking the method for a computer device as an example, and the method includes the following steps.

In step 301, extracting semantic information from the sample text to obtain text attributes of the sample text.

In order to quickly acquire the correlation among the sample texts, a network relation is determined, and the computer equipment extracts semantic information of the sample texts to obtain text attributes of the sample texts, so that a text relation network is built based on the text attributes.

Illustratively, for sample texts in the medical field, the computer equipment extracts semantic information from the sample text to obtain entities such as symptoms, medicines, disease names and the like in the article, and generates text attributes of the sample text. As shown in fig. 4, after the semantic information of the sample text is extracted by the computer equipment, the text attributes of the sample text are marked, and the text attributes of the five sample texts in the figure are respectively "abdominal pain", "headache" and "headache"; abdominal pain "," headache ".

For other fields of text, the semantic information extracted by the computer device may contain words of the corresponding field, for example, for the meteorological field, the text attribute of the sample text may be "temperature", "humidity", "earthquake", etc., and for the financial field, the text attribute of the sample text may be "fund", "stock", "bond", etc.

Optionally, the computer device directly uses the extracted semantic information as a text attribute of the sample text, or determines a corresponding attribute identifier based on the extracted semantic information, and uses the attribute identifier as the text attribute of the sample text.

Step 302, connecting sample texts with the same text attribute to generate a text relation network.

In one possible implementation, the computer device pre-processes the acquired sample text to build an implicit text graph structure, such as building a text relationship network based on a Nearest Neighbor (KNN) algorithm. The resulting text relationship network is denoted g= (V, E) and contains N nodes (i.e. sample text) V _i E V, and an edge (V _i,v_j) E, where each node has its own text attribute X _i E X, where X represents the sample text.

Illustratively, as shown in FIG. 4, the computer device links sample texts containing the same text attribute two by two (link sample texts each containing "abdominal pain" two by two and link sample texts each containing "headache") to construct a text relationship network.

In step 303, an adjacency matrix is generated based on the network structure of the text-relational network, the adjacency matrix being a two-dimensional array for characterizing the relationships between nodes in the text-relational network.

For the undirected graph, i.e., the text relationship network in the embodiment of the present application, a developer may define a adjacency matrix a E R ^N×N without self-loop in advance, i.e., a _ij is equal to 1 if (v _i,v_j) E, otherwise is equal to 0. For example, for a text relationship network containing only two sample texts, if there is an edge between the two sample texts, the corresponding adjacency matrix is

The adjacency matrix can be normalized toSo, the Laplacian matrix of a text relationship network can be expressed as/>

And 304, inputting the adjacency matrix into a text coding model to obtain an objective function.

The text coding model in the embodiment of the application generates a model. The goal of generating the model is to model the joint probability distribution p (X, Z) of the data and train the model by the method of maximum log likelihood, i.e. maxlogp (X) = maxlog ≡p (X, Z) dZ.

In one possible implementation, step 304 further includes the following steps.

Step 304a, inputting the adjacent matrix into a text coding model to obtain a priori distribution function in an objective function, wherein the priori distribution function is a Gaussian distribution function taking a target covariance matrix as a variance, and the target covariance matrix is an inverse matrix of an accuracy matrix corresponding to the adjacent matrix.

To introduce the network structure of the text-relation network (i.e., the relation between sample texts) into the generated model, the computer device adds the join information between nodes in the anterior experimental distribution. Specifically, a developer defines a priori distribution in the objective function as a gaussian distribution as follows:

Wherein the precision matrix Is the inverse of the covariance matrix Σ, and Z is the sample eigenvector.

First, assuming that the vector representation of each node is 1-dimensional, the covariance matrix Σ is subjected to taylor expansion as follows:

based on the characteristics of the covariance matrix, it can be obtained that the correlation between the node i and the node j is equal to the weight addition of the multi-order adjacency matrix. When the vector representation of the node is extended to multiple dimensions, the above-mentioned a priori distribution can be rewritten as:

where d represents the dimension of the feature vector, Is the kronecker product and τ is a positive integer used to ensure the stability of the calculated values.

Step 304b, constructing an objective function based on the prior distribution function.

The computer device determines a prior distribution function based on the adjacency matrix of the input model and constructs an objective function based on the prior distribution function. After model training is finished, hidden variables (i.e. feature vectors of nodes) of each data point can be obtained through calculating posterior distribution, however, the posterior distribution is difficult to calculate through calculating and directly maximizing likelihood functions, and therefore a variation inference method can be adopted for training. Alternatively, the true posterior distribution is approximated by introducing a variation distribution q (z|x).

In one possible implementation, step 304b further includes the following steps.

And constructing an objective function by taking the objective expectation as a first function item and the opposite number of the objective relative entropy as a second function item, wherein the objective expectation is the expectation of the joint probability distribution of the sample text and the sample feature vector, the objective relative entropy is the relative entropy between a posterior distribution function and a priori distribution function, and the posterior distribution function is the variation distribution of the joint probability distribution between the sample feature vector and the sample text.

Illustratively, the above-mentioned variation distribution is assumed to be a gaussian distribution in which each dimension is independent of the other:

Where N (Z _i;μ_i,diag(σ²)) is q (Z _i∣x_i),x_i is the ith sample text in the text relationship network, Z _i is the sample feature vector of the ith sample text, μ _i is the expectation that the ith sample text corresponds to the probability distribution, σ ² is the variance of the probability distribution.

Thus, the computer device constructs the following objective function:

logp(X)≥E_q(Z∣X)[logp(X∣Z)]-KL(q(Z∣X)||p(Z)

Wherein p (Z) is a priori distribution, a first term function term in the objective function is logp (X|Z) is expected to be used for constraining the model so that each feature vector can be reconstructed back to the content of the sample text, and a second term is the opposite number of relative entropy between the a priori distribution and the posterior distribution and is used for constraining the model so that the correlation between the feature vectors can simulate the network structure of the text relation network.

If the approximate posterior distribution is set to a mutually independent form, i.e.

The objective function expansion results in:

using the properties of Gaussian distributions, one can get the display expressions for H (q (Z. Cndot. Cndot.)) and E _q(Z∣X) [ lovp (Z) ] and the final objective function can be expanded to:

Wherein P is In shorthand form.

Optionally, the objective function is introduced into the node relationship through special gaussian distribution, and other distributions capable of representing data correlation besides gaussian distribution can also be used for realizing the construction of the objective function, which is not limited by the embodiment of the present application.

And 305, inputting the sample texts in the text relation network into a text coding model to obtain sample feature vectors corresponding to the sample texts.

After the objective function is built, the computer equipment starts model training. The computer device enters all sample text in the text relationship network simultaneously into the text encoding model,

Optionally, when the computer device in the embodiment of the application represents the sample text, a one-hot encoding (one-hot encoding) vector is used for representing the text, and in addition, word Frequency characteristics (Term Frequency-Inverse Document Frequency, TF-IDF) of the text, word Vectors such as Global Vectors (GloVe), or a pretrained model such as Long Short-Term Memory (LSTM), recurrent neural networks (Recurrent Neural Network, RNN), a gating unit (Gated Recurrent Unit, GRU) and the like can be used for representing the text.

Step 306, determining model loss based on the sample feature vector and the objective function.

Step 307, performing iterative training on the text coding model based on the model loss.

Specific embodiments of steps 305 to 306 may refer to steps 201 to 203, which are not limited in this embodiment of the present application.

In the embodiment of the application, the text relation network is constructed based on the text attribute of the sample text, and the prior distribution in the objective function is determined based on the adjacency matrix of the text relation network, so that the objective function is used for restraining and training the model from two aspects of text semantics and network structure, and the text coding performance of the text coding model and the model training efficiency are improved.

The application tests with the disclosed data set on the task of link prediction. Optionally, given two texts, judging whether a connecting edge exists between the two texts, taking the accuracy as an evaluation index of the model, and testing by using the model in the related technology and the text coding model provided by the application, wherein the test result is as follows:

TABLE 1

As can be seen from Table 1, the text coding model provided by the embodiment of the application can achieve good performance compared with the method in the related art because the semantic information of the nodes is considered and modeling is performed based on the connection relation between the nodes.

The various embodiments described above illustrate a training process for a text encoding model that may be applied to information retrieval in one possible implementation. Fig. 5 shows a flowchart of an information retrieval method according to an exemplary embodiment of the present application. This embodiment will be described by taking the method for a computer device as an example, and the method includes the following steps.

In step 501, in response to a text retrieval operation, retrieval information is acquired based on the text retrieval operation.

In one possible implementation, the application has text retrieval functionality. The user inputs the search information to enable the computer equipment to return a search result, wherein the search result is a text corresponding to the search information.

Optionally, a text retrieval area is displayed in a user interface of the application program, and when a content input operation in the text retrieval area is received, the acquired input content is determined as retrieval information; or the application can receive voice information and the user can input the retrieval information by voice.

Illustratively, the user enters "what skin allergy should be noted" in the text retrieval area and triggers the retrieval control, and the computer device obtains the retrieval information "what skin allergy should be noted".

Step 502, inputting the search information into a text coding model to obtain a search information feature vector corresponding to the search information.

The text coding model is a model obtained by training based on a text relation network by taking an objective function as a training target, the text relation network is an undirected graph taking texts as nodes and taking relations among the texts as edges, the objective function comprises a first function item and a second function item, the first function item is used for representing the representation quality of sample feature vectors on semantic information in the sample texts, and the second function item is used for representing the simulation quality of correlation among the sample feature vectors on the text relation network. Reference may be made to the embodiments described above for specific model training procedures.

The trained text coding model is an encoder, and the computer equipment inputs the acquired retrieval information into the encoder to obtain the retrieval information feature vector.

In step 503, a target text is determined from the text library based on the feature vector of the search information, wherein the target text is text having a correlation with the search information.

The computer device needs to retrieve text associated with the retrieved information, and in one possible implementation, the application program corresponds to a text library, and each text in the text library corresponds to a text feature vector, and step 503 includes the steps of:

in step 503a, candidate text feature vectors of each candidate text in the text library are obtained, and the candidate text feature vectors are obtained by inputting the candidate text into the text coding model.

In one possible implementation, the computer device inputs candidate texts in the text library into the trained text coding model in advance to obtain candidate text feature vectors of the respective candidate texts.

In step 503b, the similarity between the search information feature vector and each candidate text feature vector is calculated.

The computer device screens out the target text by calculating the similarity between the search information feature vector and each candidate text feature vector, and by calculating the similarity between the feature vectors and the correspondence between the candidate text feature vector and the candidate text.

Optionally, the similarity between feature vectors is characterized by cosine distance or euler distance equidistance.

In step 503c, candidate text with similarity greater than the similarity threshold is determined as the target text.

Optionally, a similarity threshold (for example, 80%) is preset in the computer device, and candidate texts with similarity greater than the similarity threshold are determined as target texts; or the computer equipment ranks the candidate texts according to the similarity from high to low, determines the first n candidate texts as target texts, and n is a positive integer.

And 504, displaying the target text through a search result display interface.

As shown in fig. 6, a search area 602 is displayed in the search result presentation interface 601, and when the computer device receives a text search operation of the search area 602, it acquires search information "what should be noted by skin allergy", determines a target text 603 based on the search information, and then displays the target text 603 in the search result presentation interface 601. For the process of determining the target text by the computer apparatus based on the retrieval information, illustratively, as shown in fig. 7, the executing steps of the computer apparatus include: step 701, obtaining search information: skin allergies should be noted; step 702, outputting a retrieval information feature vector; step 703, calculating the cosine distance between the search information feature vector and the candidate text feature vector; step 704, a target document is determined based on the cosine distance.

In one possible implementation manner, when receiving the text uploading operation, the computer device obtains the text to be stored based on the text uploading operation, inputs the text to be stored into a text coding model to obtain a feature vector corresponding to the text to be stored, and stores the text to be stored and the corresponding feature vector in a text library in an associated manner.

In the embodiment of the application, in the model application stage, the text coding model is utilized to code the search information, and the target file is inquired and displayed based on the obtained search information feature vector, so that the information search efficiency and accuracy are improved.

In order to prove the applicability of the text coding model provided by the application in a medical information retrieval scene, developers use medical text data to carry out comparison experiments. Schematically, for an article in a given medical field, searching is performed by using several models in the related technology and the text coding model provided by the embodiment of the application, 1000 articles most similar to the given article are searched, and the model searching performance test is performed by calculating the duty ratio of articles in the same category in the search result, wherein the test result is as follows:

TABLE 2

As can be seen from table 2, compared with the model in the related art, the embodiment of the present application achieves a better effect in the medical article retrieval scene.

FIG. 8 is a block diagram of a training apparatus for text encoding models provided in an exemplary embodiment of the present application, the apparatus comprising:

A first input module 801, configured to input a sample text in a text relationship network into a text coding model to obtain sample feature vectors corresponding to each sample text, where the text relationship network is an undirected graph with the sample text as a node and a connection line between neighboring nodes as edges, and the neighboring nodes have the same text attribute;

A first determining module 802, configured to determine a model loss based on the sample feature vectors and an objective function, where the objective function includes a first function term and a second function term, the first function term is used to characterize a quality of representation of semantic information in the sample text by the sample feature vectors, and the second function term is used to characterize a quality of simulation of the text relationship network by correlations between the sample feature vectors;

a training module 803, configured to iteratively train the text encoding model based on the model loss.

Optionally, the apparatus further includes:

The information extraction module is used for extracting semantic information from the sample text to obtain the text attribute of the sample text;

the first generation module is used for connecting the sample texts with the same text attribute to generate the text relation network;

the second generation module is used for generating an adjacency matrix based on the network structure of the text relationship network, wherein the adjacency matrix is a two-dimensional array used for representing the relationship between nodes in the text relationship network;

and the second input module is used for inputting the adjacency matrix into the text coding model to obtain the objective function.

Optionally, the text encoding model is a generating model;

The second input module includes:

the input unit is used for inputting the adjacent matrix into the text coding model to obtain a priori distribution function in the objective function, wherein the priori distribution function is a Gaussian distribution function taking a target covariance matrix as a variance, and the target covariance matrix is an inverse matrix of an accuracy matrix corresponding to the adjacent matrix;

And a function construction unit for constructing the objective function based on the prior distribution function.

Optionally, the function construction unit is further configured to construct the objective function with an objective expectation being the first function term and an inverse of an objective relative entropy being an expectation of a joint probability distribution between the sample text and the sample feature vector, and the objective relative entropy being a relative entropy between a posterior distribution function and the prior distribution function, and the posterior distribution function being a variation distribution of the joint probability distribution between the sample feature vector and the sample text.

Fig. 9 is a block diagram of an information retrieval apparatus according to an exemplary embodiment of the present application, the apparatus including:

Optionally, the second determining module includes:

The obtaining unit is used for obtaining candidate text feature vectors of each candidate text in the text library, wherein the candidate text feature vectors are obtained by inputting the candidate text into the text coding model;

a calculation unit for calculating the similarity between the search information feature vector and each candidate text feature vector;

and the determining unit is used for determining the candidate text with the similarity larger than a similarity threshold value as the target text.

It should be noted that: the apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the training device of the text coding model provided in the above embodiment and the training method embodiment of the text coding model belong to the same concept, and the information retrieval device and the information retrieval method embodiment belong to the same concept, and detailed implementation processes of the training device and the training method embodiment of the text coding model are detailed in the method embodiment, and are not repeated here.

Referring to fig. 10, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. Specifically, the present application relates to a method for manufacturing a semiconductor device. The computer apparatus 1000 includes a central processing unit (Central Processing Unit, CPU) 1001, a system Memory 1004 including a random access Memory (Random Access Memory, RAM) 1002 and a Read-Only Memory (ROM) 1003, and a system bus 1005 connecting the system Memory 1004 and the central processing unit 1001. The computer device 1000 also includes a basic Input/Output system (I/O) 1006, which helps to transfer information between various devices within the computer, and a mass storage device 1007 for storing an operating system 1013, application programs 1014, and other program modules 1015.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1008 and the input device 1009 are connected to the central processing unit 1001 via an input output controller 1010 connected to a system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the computer device 1000. That is, the mass storage device 1007 may include a computer readable medium (not shown) such as a hard disk or a compact disk-Only (CD-ROM) drive.

Generally, the computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY, EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (DIGITAL VERSATILE DISC, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1004 and mass storage devices 1007 described above may be collectively referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1001, the one or more programs containing instructions for implementing the methods described above, the central processing unit 1001 executing the one or more programs to implement the methods provided by the various method embodiments described above.

According to various embodiments of the application, the computer device 1000 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1011.

The memory also includes one or more programs stored in the memory, the one or more programs including steps for performing the methods provided by the embodiments of the present application, as performed by the computer device.

Embodiments of the present application also provide a computer readable storage medium storing at least one instruction that is loaded and executed by a processor to implement the training method of a text encoding model or the information retrieval method described in the above embodiments.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the training method of the text encoding model provided in the various alternative implementations of the above aspects, or the information retrieval method.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium. Computer-readable storage media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A method of training a text encoding model, the method comprising:

extracting semantic information from a sample text to obtain text attributes of the sample text;

Connecting the sample texts with the same text attribute to generate a text relationship network, wherein the text relationship network is an undirected graph taking the sample text as a node and taking a connecting line between adjacent nodes as an edge, and the adjacent nodes have the same text attribute;

generating an adjacency matrix based on a network structure of the text relationship network, wherein the adjacency matrix is a two-dimensional array for representing the relationship between nodes in the text relationship network;

Inputting the adjacent matrix into a text coding model to obtain a priori distribution function, wherein the text coding model generates a model, the priori distribution function is a Gaussian distribution function taking a target covariance matrix as a variance, and the target covariance matrix is an inverse matrix of an accuracy matrix corresponding to the adjacent matrix;

Constructing an objective function based on the prior distribution function, wherein the objective function comprises a first function item and a second function item, the first function item is used for representing the representation quality of a sample feature vector on semantic information in the sample text, and the second function item is used for representing the simulation quality of correlation among the sample feature vectors on the text relation network;

inputting the sample texts in the text relation network into the text coding model to obtain sample feature vectors corresponding to the sample texts;

determining a model loss based on the sample feature vector and the objective function;

2. The method of claim 1, wherein the constructing an objective function based on the prior distribution function comprises:

And constructing the objective function by taking an objective expectation as the first function term and taking the opposite number of the objective relative entropy as the second function term, wherein the objective expectation is the expectation of the joint probability distribution of the sample text and the sample feature vector, the objective relative entropy is the relative entropy between a posterior distribution function and the prior distribution function, and the posterior distribution function is the variation distribution of the joint probability distribution between the sample feature vector and the sample text.

3. An information retrieval method, the method comprising:

inputting the search information into a text coding model to obtain a search information feature vector corresponding to the search information, wherein the text coding model is obtained by training the training method of the text coding model according to claim 1 or2;

And displaying the target text through a search result display interface.

4. The method of claim 3, wherein the determining target text from a text library based on the retrieved information feature vector comprises:

Obtaining candidate text feature vectors of each candidate text in the text library, wherein the candidate text feature vectors are obtained by inputting the candidate text into the text coding model;

Calculating the similarity between the retrieval information feature vector and each candidate text feature vector;

and determining the candidate text with the similarity larger than a similarity threshold as the target text.

5. A training device for a text encoding model, the device comprising:

The first generation module is used for connecting the sample texts with the same text attribute to generate a text relationship network, wherein the text relationship network is an undirected graph taking the sample text as a node and taking a connecting line between adjacent nodes as an edge, and the adjacent nodes have the same text attribute;

The second input module is used for inputting the adjacent matrix into a text coding model to obtain a priori distribution function, the text coding model generates a model, the priori distribution function is a Gaussian distribution function taking a target covariance matrix as a variance, and the target covariance matrix is an inverse matrix of an accuracy matrix corresponding to the adjacent matrix; constructing an objective function based on the prior distribution function, wherein the objective function comprises a first function item and a second function item, the first function item is used for representing the representation quality of a sample feature vector on semantic information in the sample text, and the second function item is used for representing the simulation quality of correlation among the sample feature vectors on the text relation network;

The first input module is used for inputting the sample texts in the text relation network into the text coding model to obtain sample feature vectors corresponding to the sample texts;

A first determining module for determining a model loss based on the sample feature vector and the objective function;

6. An information retrieval apparatus, the apparatus comprising:

The second input module is used for inputting the search information into a text coding model to obtain a search information feature vector corresponding to the search information, and the text coding model is obtained by training the training device of the text coding model according to claim 5;

7. A computer device, the computer device comprising a processor and a memory; the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, which are loaded and executed by the processor to implement the training method of the text encoding model of claim 1 or 2, or the information retrieval method of claim 3 or 4.

8. A computer readable storage medium, characterized in that at least one computer program is stored in the computer readable storage medium, which computer program is loaded and executed by a processor to implement the training method of a text encoding model according to claim 1 or 2, or the information retrieval method according to claim 3 or 4.