CN117275589A

CN117275589A - HIV antibody affinity prediction methods, systems, equipment and media

Info

Publication number: CN117275589A
Application number: CN202311088052.4A
Authority: CN
Inventors: 李毅; 尹国轩; 程宏晗; 羊海潮
Original assignee: Dali University
Current assignee: Dali University
Priority date: 2023-08-28
Filing date: 2023-08-28
Publication date: 2023-12-22

Abstract

The invention relates to an HIV antibody affinity prediction method, an HIV antibody affinity prediction system, a computer device and a storage medium. The method comprises the following steps: searching affinity data of HIV virus proteins and antibodies, corresponding HIV virus protein sequence data of the antibodies, generating a data set by taking IC50 as a standard for measuring the affinity, and preprocessing the data to obtain an input data set; constructing a reference machine learning model, and dividing an input data set into a first training set and a first testing set; adjusting reference machine learning model parameters to obtain a first training result; constructing a cyclic neural network model, dividing an input data set into a second training set, a verification set and a second test set, and training the cyclic neural network model to output second affinity prediction data to obtain a second training result; and determining an affinity prediction result. The predicted performance of the affinity of the HIV antibody is improved by constructing a neutralizing antibody-HIV virus protein-IC 50 input data set and comparing the predicted performance of the affinity quantification of the two models.

Description

HIV antibody affinity prediction method, system, device and medium

Technical Field

The invention relates to the technical field of antibody affinity detection, in particular to an HIV antibody affinity prediction method, an HIV antibody affinity prediction system, a computer device and a storage medium.

Background

An HIV antibody is a disease antibody that is effective against the HIV virus. The envelope protein of the HIV virus is the main molecular basis that determines the interaction of the HIV virus with broad-spectrum neutralizing antibodies (bNAbs), and its antigenic diversity severely hampers the development of effective antiviral therapeutics or vaccines. Broad-spectrum neutralizing antibodies targeting envelope protein gp160 have been shown to be a promising new drug or vaccine development pathway by inducing immune responses. It is possible to study the envelope proteins of the HIV virus using artificial intelligence methods. Some scholars generally classify HIV viruses into two categories, sensitivity and resistance, according to their affinity for antibodies, and conduct non-quantitative classification studies in the form of two categories.

However, existing experimental methods for determining affinity of HIV antibodies are very time consuming and laborious. Some calculation methods have been used to predict HIV antibody affinity, but have the problems of insufficient antibody coverage, low accuracy, and no indication of molecular mechanisms.

Disclosure of Invention

In order to solve the above problems, an HIV antibody affinity prediction method, system, computer device, and storage medium are provided, which can improve the performance of predicting HIV antibody affinity.

A method of HIV antibody affinity prediction, the method comprising:

searching the data base for the affinity data of HIV virus proteins and antibodies, the corresponding HIV virus protein sequence data of antibodies, and generating a data set by taking the IC50 as a standard for measuring the affinity;

performing data preprocessing on the data set to obtain an input data set;

constructing a reference machine learning model, and dividing the input data set into a first training set and a first testing set; training the reference machine learning model by using the first training set, adjusting parameters of the reference machine learning model, outputting first affinity prediction data, and obtaining a first training result according to the first testing set;

constructing a cyclic neural network model, dividing the input data set into a second training set, a verification set and a second test set, training the cyclic neural network model by using the second training set, outputting second affinity prediction data, and obtaining a second training result according to the second test set;

and determining an affinity prediction result according to the first training result and the second training result.

In one embodiment, the searching the database for data on affinity of HIV viral proteins to antibodies, data on sequences of antibodies corresponding to HIV viral proteins, and generating a data set using IC50 as a criterion for measuring affinity comprises:

Searching the public CATNAP database for affinity data of HIV virus proteins and experimentally determined neutralizing antibodies;

IC50 as quantitative experimentally determined antibody affinity data;

the data set was generated by sorting in the form of neutralizing antibody-HIV viral protein-IC 50.

In one embodiment, the public CATNAP database stores HIV viral protein sequences, antibody protein sequences; searching the public CATNAP database for neutralizing antibody affinity data of HIV viral proteins with experimental assays, comprising:

collecting a retrieval instruction through a retrieval frame;

and deleting the HIV virus data of unrecorded sequences in the public CATNAP database according to the search instruction, and screening out neutralizing antibody affinity data of HIV virus proteins and experimental determination.

In one embodiment, the data preprocessing the data set to obtain an input data set includes:

encoding the data set, and encoding the characteristics of the data in the data set into a digital form;

normalizing the data characteristics in the numerical form to obtain an input data set;

the data are characterized by amino acid numerical codes, single thermal codes and amino acid physicochemical property codes.

In one embodiment, the reference machine learning model includes a decision tree and a random forest; dividing the input data set into a first training set and a first test set; training the reference machine learning model using the first training set and adjusting the reference machine learning model parameters, comprising:

dividing the input data set into a first training set and a first testing set according to the proportion of 8:2;

training the reference machine learning model by the first training set, and testing training results by the first testing set to obtain testing results;

and adjusting the reference machine learning model parameters according to the test result.

In one embodiment, the recurrent neural network model comprises a two-way gating neural unit and a two-way long and short term memory network; dividing the input data set into a second training set, a validation set, and a second test set, training the recurrent neural network model using the second training set, comprising:

dividing the input data set into a second training set, a verification set and a second test set according to the proportion of 8:1:1;

training the second training set into the cyclic neural network model, setting the cyclic neural network unit in the cyclic neural network model as GRU or LSTM, taking the output of the cyclic neural network unit as the input of a full-connection layer, and performing dimension transformation and affinity value calculation on the full-connection layer to obtain training parameters of the cyclic neural network model;

And outputting second affinity prediction data according to the training parameters, and obtaining a second training result according to the verification set and the second test set.

In one embodiment, performing dimension transformation and calculation of affinity values at the fully connected layer includes:

inputting the input data set into a bidirectional cyclic neural unit and a linear layer through the cyclic neural network model to extract data features, and outputting antibody feature vectors and virus protein feature vectors;

and combining the antibody characteristic vector and the virus protein characteristic vector into one characteristic vector, inputting the characteristic vector into the full-connection layer, and carrying out dimension transformation and calculation of an affinity value through the full-connection layer.

An HIV antibody affinity prediction system, the system comprising:

a data set generation module for searching the affinity data of the HIV virus proteins and the antibodies and the sequence data of the HIV virus proteins corresponding to the antibodies in the database, and generating a data set by taking the IC50 as a standard for measuring the affinity;

the preprocessing module is used for preprocessing the data of the data set to obtain an input data set;

the reference machine learning model training module is used for constructing a reference machine learning model and dividing the input data set into a first training set and a first testing set; training the reference machine learning model by using the first training set, adjusting parameters of the reference machine learning model, outputting first affinity prediction data, and obtaining a first training result according to the first testing set;

The circulating neural network model training module is used for constructing a circulating neural network model, dividing the input data set into a second training set, a verification set and a second test set, training the circulating neural network model by using the second training set, outputting second affinity prediction data, and obtaining a second training result according to the second test set;

and the affinity prediction module is used for determining an affinity prediction result according to the first training result and the second training result.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

performing data preprocessing on the data set to obtain an input data set;

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

performing data preprocessing on the data set to obtain an input data set;

According to the HIV antibody affinity prediction method, the system, the computer equipment and the storage medium, the neutralizing antibody-HIV virus protein-IC 50 input data set is constructed, the standard machine learning model and the circulating neural network model are built for training test, and the affinity quantitative prediction performance of the two models is compared, so that the predicted result with higher performance is used as the final affinity predicted result, and the predicted performance of HIV antibody affinity is improved.

Drawings

FIG. 1 is a diagram of the environment in which the HIV antibody affinity prediction method of one embodiment is used;

FIG. 2 is a flow chart of a method for predicting affinity of an HIV antibody according to one embodiment;

FIG. 3 is a schematic diagram of a Bi-GRU model structure in one embodiment;

FIG. 4 is a diagram of a viral antibody affinity prediction architecture based on a recurrent neural network model in one embodiment;

FIG. 5 is a block diagram of a HIV antibody affinity prediction system in one embodiment;

fig. 6 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It is to be understood that the terms "first," "second," and the like, as used herein, may be used to describe a training set, a testing set, but that these training set, testing set are not limited by these terms. These terms are only used to distinguish a first training set, test set, from another training set, test set. For example, a first training set, a test set may be referred to as a second training set, a test set, and similarly, a second training set, a test set may be referred to as a first training set, a test set, without departing from the scope of the present application. The first training set, the test set, and the second training set, the test set are both training sets, the test set, but they are not the same training set, the test set.

The HIV antibody affinity prediction method provided in the embodiments of the present application may be applied to an application environment as shown in fig. 1. As shown in FIG. 1, the application environment includes a computer device 110. Computer device 110 may retrieve the affinity data for the HIV viral proteins and antibodies, the corresponding HIV viral protein sequence data for the antibodies, and generate a dataset using the IC50 as a criterion for measuring affinity; the computer device 110 may perform data preprocessing on the data set to obtain an input data set; the computer device 110 may construct a baseline machine learning model that divides the input data set into a first training set and a first test set; training a reference machine learning model by using a first training set, adjusting parameters of the reference machine learning model, outputting first affinity prediction data, and obtaining a first training result according to a first test set; the computer device 110 may construct a recurrent neural network model, divide the input data set into a second training set, a validation set, a second test set, train the recurrent neural network model using the second training set, output second affinity prediction data, and obtain a second training result according to the second test set; the computer device 110 may determine the affinity prediction result based on the first training result, the second training result. The computer device 110 may be, but is not limited to, various personal computers, notebook computers, robots, unmanned aerial vehicles, tablet computers, and the like.

In one embodiment, as shown in fig. 2, a method for predicting affinity of an HIV antibody is provided, comprising the steps of:

step 202, searching the database for the affinity data of HIV virus proteins and antibodies, and the corresponding HIV virus protein sequence data of antibodies, and generating a data set by taking IC50 as a standard for measuring the affinity.

The database may be CATNAP database, genBank sequence database, in which the sequence of HIV virus envelope protein gp160, and antibody protein sequence are recorded in detail. In this example, the affinity data set of CATNAP database HIV viral proteins and antibodies and the corresponding HIV viral protein sequence data set of each antibody can be used, and the half-inhibitory concentration IC50 can be used as a standard for measuring affinity for quantitative multiple experimental determinations, thereby preparing the data set.

In one embodiment, the provided method for predicting affinity of an HIV antibody may further comprise a process of preparing a data set, the specific process comprising: searching the public CATNAP database for affinity data of HIV virus proteins and experimentally determined neutralizing antibodies; IC50 as quantitative experimentally determined antibody affinity data; the data set was generated by sorting in the form of neutralizing antibody-HIV viral protein-IC 50.

In one embodiment, the provided method for predicting affinity of an HIV antibody may further comprise a process of selecting data from a database, the specific process comprising: collecting a retrieval instruction through a retrieval frame; according to the search instruction, deleting the HIV virus data of unrecorded sequences in the public CATNAP database, and screening the affinity data of the HIV virus protein and the neutralizing antibody determined by experiments.

Wherein, a search box can be displayed in a display screen of the computer equipment, and the searching mode of the antibody and virus (Antibody and Virus) can be adopted, when searching, the 'Exclude viruses having no sequence data' is selected in the search box, the HIV virus with unrecorded sequence is removed, and the interaction sparsity of the antibody and the virus is measured by using the IC50 (in mug/ml).

Step 204, data preprocessing is performed on the data set to obtain an input data set.

In the CATNAP database, 10-1074,2G12,3BNC117, PG9, PGDM1400, PGT121, VRC01 and b12, which are currently being clinically developed, can be selected among the antibodies whose total numbers are ranked ahead. Based on the data set provided by the CATNAP database, a data preprocessing operation can be performed, the number of the preprocessed basic data sets is updated accordingly, and as shown in the following table, a total of 6418 virus protein antibody interaction pairs are respectively formed by specifically pairing sequences of 8 antibodies and 1156 unique HIV envelope protein sequences, so that an input data set is formed.

Number of gp160 sequences of antibody protein and HIV envelope protein after pretreatment

After the computer equipment is arranged to generate an input data set, a reference model method, namely a random forest, a decision tree and a deep learning method, can be used for respectively researching and exploring the interaction of HIV virus envelope proteins and antibody proteins, and the regression prediction performance of several models on HIV antibody data can be compared.

Specifically, in one embodiment, the provided method for predicting affinity of an HIV antibody may further include a process of performing data preprocessing, including: encoding the data set, and encoding the characteristics of the data in the data set into a digital form; normalizing the data characteristics in the numerical form to obtain an input data set; the data are characterized by amino acid numerical codes, single thermal codes and amino acid physicochemical property codes.

Wherein, the characteristic representation of the data adopts three modes of amino acid digital coding, single thermal coding and amino acid physicochemical property coding, and the input of the data in the reference model is sequence amino acid digital coding. A dictionary of 20 amino acids and numbers is constructed, and a sequence is encoded into a numpy array. Single-hot encoding (one-hot encoding) is commonly used in data encoding, processes non-numeric sequences, encodes state features with N-bit registers, and has only one bit valid at any time, so single-hot encoding is also called one-bit valid encoding, and the final data is represented by binary vectors; common amino acid abbreviations and their specific physicochemical properties (physicochemical properties) such as charge, hydraulic power, polarity, volume, etc., amino acid physicochemical properties in protein sequences play a key role in protein folding, affecting protein function; the physicochemical properties may be characterized as numerical values.

Step 206, constructing a reference machine learning model, and dividing an input data set into a first training set and a first test set; and training a reference machine learning model by using the first training set, adjusting parameters of the reference machine learning model, outputting first affinity prediction data, and obtaining a first training result according to the first test set.

Based on the generated neutralizing antibody-HIV virus protein-IC 50 data set, the computer device can build two reference machine learning models, namely decision trees and random forests, and study the affinity of HIV virus protein and antibody through the two reference models to compare the performance of the two models.

The computer device can target quantitative prediction of HIV antibody affinity, establish a benchmark model for HIV antibody affinity regression, and search parameters of a benchmark model random forest, decision tree and train using a grid search method for better prediction effect.

The method comprises the steps of selecting a decision tree and a random forest from a reference machine learning model to predict affinity, wherein the general step of the decision tree is to collect a data set required by training, process the data set to enable the data set to meet the input of the decision tree, and then recursively construct the decision tree by trained data to finally enable the decision tree to reach a value capable of predicting the accuracy. The decision tree model is chosen to predict the affinity of HIV virus to antibody because the decision tree is friendly to data set processing and can process the feature of continuity to make regression, which is especially suitable for protein sequences.

The Random Forest (RF) regression uses an ensemble learning (ensemble learning) method to carry out a supervised learning algorithm of regression, and the ensemble learning method combines predictions of a plurality of machine learning algorithms to carry out more accurate predictions than a single model, so that the training accuracy is improved.

The evaluation and selection of the model are very important steps in machine learning, in order to make the model have better generalization capability, the model needs to be tested and then compared through various evaluation indexes, and finally the model with the best performance on the performance measurement standard is selected as the optimal model. Four different indices are chosen to measure the performance of the model, respectively the decision coefficients (R ² ) Mean Absolute Error (MAE), root Mean Square Error (RMSE) and PearsonCorrelation coefficient (Pearson).

In one embodiment, the provided method for predicting affinity of an HIV antibody may further comprise a process of training a reference machine learning model, the specific process comprising: dividing an input data set into a first training set and a first test set according to the proportion of 8:2; training a reference machine learning model by a first training set, and testing training results by a first testing set to obtain testing results; and adjusting the reference machine learning model parameters according to the test result.

The computer device may divide the input data set into a training set and a test set on an 8:2 scale; pandas randomly samples and scrambles the data sequence, resets the index to obtain train, test two data sets, and the number of each data set is 5134 and 1284 respectively.

In consideration of the condition that sequence composition is not uniform among different data of a protein sequence, in reference model training, after the data are encoded into a numerical form, normalization operation is carried out on the data, original distribution is changed, values of the data are adjusted, influences of different physical dimensions are eliminated, so that influence weights of all dimensions of the characteristics on an objective function are equal, and comparison and weighting are facilitated. Mapping data values to [0,1 ] can be performed in a machine learning method using normalization of protein sequence data]Interval, speeding up model solution accuracy, for example, for data set X there is:after training the model, the accuracy and validity of the model needs to be verified on the data. R is R ² A negative value of the value of (c) may occur because the data is not linear. The results of the model test are shown in the following table:

model HIV antibody affinity predictive performance on GP160 sequence

Therefore, the random forest RF is superior to the decision coefficient of the decision tree DT in the most important evaluation criterion, and the average absolute error and root mean square error are lower than those of the decision tree DT, and the pearson correlation coefficient is closer to 1 than those of the decision tree, so that the random forest is superior to the decision tree in each evaluation criterion, because the random forest selects part of data to establish a plurality of decision trees, the prediction result is obtained by referring to the plurality of decision trees, and the influence caused by abnormal values is reduced. All the characteristics and samples are adopted in the training process of the decision tree, the over fitting is easy to occur, the random forest is a plurality of decision trees constructed by adopting part of the characteristics of part of the data samples, the characteristics and the data are reduced on a single decision tree, and the possibility of the over fitting is reduced.

And step 208, constructing a cyclic neural network model, dividing the input data set into a second training set, a verification set and a second test set, training the cyclic neural network model by using the second training set, outputting second affinity prediction data, and obtaining a second training result according to the second test set.

The computer equipment can evaluate the output of the circulating neural network model based on the circulating neural network model, mainly comprising a two-way gating unit mechanism and a two-way long-short-term memory network, compare the predicted performance with a reference model, explore the internal relation of the interaction of viruses and antibodies, and introduce the attention mechanism to develop further exploration.

The cyclic neural network (Rerrent Neural Network, RNN) is a common deep learning neural network, and the RNN neural network framework flexibly learns and makes adaptive decisions based on sequence data. Some functions built in the hidden layer of the cyclic neural network memorize the previous input and influence the current input and output, and the information is circulated in the memory unit, and the weight is adjusted by gradient descent and time back propagation. Classical RNNs experience problems of gradient explosion and gradient disappearance, these algorithms cannot cope with long-distance dependence, long-term memory networks and gated neural units appear, and RNNs are expanded, mainly in the memory of information, to determine forgotten and updated data information, thereby affecting the output of the model, so that the recurrent neural network is very suitable for learning time series data or similar sequence data.

The cyclic neural network model comprises a two-way gating neural unit and a two-way long-short-term memory network. Namely, two neural networks, namely a two-way gating neural unit and a two-way long-short-term memory network are adopted. Gating neural units (Gate Recurrent Unit, GRU), a variant of RNNs, by which is meant that a unit decides whether to store or delete information based on its importance to information distribution, and GRU is more suitable for learning data with long-term dependency characteristics than common RNNs. t denotes the time of day, in the GRU model, by resetting the gate (r _t ) Function and update gate (u) _t ) The functions keep important features, the two functions constantly learn weight parameters, and the importance of the information is distributed according to the weight. The GRU is composed of a plurality of identical network modules, and aims to solve the short-term memory problem of the RNN model, and hidden state adjustment information of the neural network is used, mainly for updating parameters. In the GRU neural unit, the update gate helps the model to determine the scale of the state information transmitted to the current state at the previous moment, and the larger the value of the update gate is, the more the state information is transmitted at the previous moment; the reset gate is responsible for short-term memory of the network. Thus, the fewer gate functions result in a fewer number of GRUs in parameters than the number of LSTM three gate functions, the training speed is relatively faster, and the problem of gradient disappearance as training progresses is solved.

The function method implemented in the GRU can be expressed as: r is (r) _t ＝σ(W _r x _t +U _r h _t-1 +b _h )；u _t ＝σ(W _t x _t +U _u h _t-1 +b _u )； Model learning parameter W _r ，U _r ，b _h ，W _t ，U _u ，b _u ，W _h ，U _h . Wherein r is _t Representing a reset gate, σ is a Sigmod function, identical to tanhTo activate the function, x _t An input vector representing the input vector as the t time stamp is input into the network element and multiplied by a weight matrix W _r ，h _t-1 Saving the information of the first t-1 units and the weight U thereof _r The sum of them plus the bias parameter is applied to the sigmoid activation function to compress the result to between 0 and 1; update gate u _t The difference from the reset gate is that the weight matrix and the bias are different; to find hidden state h in GRU _t The influence degree of the hidden state on the candidate state before the gate control needs to be reset, the update gate determines how much acquired information is transferred to the next step, and the hidden state is replaced by the candidate hidden state>Acquiring input sum r from time stamp t-1 _t And hiding the result of the state product.

Long short-term memory (LSTM) is a special variant of Recurrent Neural Networks (RNN) that overcomes the stability bottleneck encountered in conventional recurrent neural networks. The RNN is unique in that its unique function is to share parameters at each layer of the network, which share the same weight parameters at each layer of the network, and is tuned by the process of back propagation and gradient descent. The LSTM consists of a cell state and three gate structures, wherein the cell state is the key of the LSTM, the cell state acts as an information transmission path, related information in the sequence processing process can be always transmitted, even if the information of an earlier time step can be carried into cells of a later time step, the bottleneck of the traditional neural network short-time memory is overcome, the three gate structures in the LSTM determine which information is saved or forgotten in the training process, and the forgotten gate selects whether the information from the previous time stamp is memorized or irrelevant and can be forgotten; the input gate attempts to learn new information from the information input to this unit; the output gate passes updated information from the current timestamp to the next timestamp.

The functional method implemented in LSTM can be expressed as: f (f) _t ＝σ(W _f x _t +U _f h _t-1 +b _f )；i _t ＝σ(W _i x _t +U _i h _t-1 +b _i )；O _t ＝σ(W _o x _t +U _o h _t-1 +b _o )； h _t ＝O _t -tanh(C _t ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein O is _t Is the representation of the output gate at time t, forget about the gate f _t Indicating (I)>Is the updated cell state, C _t Is the cell state, i _t Is the input and update of the gate activation vector, these states propagate forward through the network, with the forgetting gate playing a key role in reducing overfitting.

In this embodiment, the Bi-directional gated loop unit (bidirectional gated recurrent unit, bi-GRU) and the Bi-directional long and short term memory network (bidirectional long short-term memory, bi-LSTM) comprise a sequential processing model of a forward loop neural network and a backward loop neural network. The forward cyclic neural network processes the input sequence in a forward manner, the backward cyclic neural network processes the sequence in the reverse order, and the idea of learning back and forth provides the output layer with complete context information of the input information at each moment.

The Bi-GRU model structure is shown in FIG. 3, and the same built-in GRU units can be set as LSTM network units. At a certain moment, the inputs will be simultaneously provided to two oppositely directed recurrent neural networks, each generating a respective state and output at that moment, the output of the bi-directional recurrent neural network will be a simple splice of the outputs of the two unidirectional networks. The function of a protein is related to the arrangement of its sequence, which determines the position of the protein residues in the structure, and the model of a two-way recurrent neural network is well suited for such a correlation of sequence data.

The viral antibody affinity prediction model diagram of the circulating neural network is shown in fig. 4, wherein the viral antibody affinity prediction model diagram is used for performing independent thermal coding or amino acid physicochemical property coding on an HIV viral envelope protein sequence and a neutralizing antibody protein sequence respectively, coded data is transmitted into the circulating neural network, a circulating neural network unit can be set to be GRU or LSTM, the output of the circulating neural network is used as the input of a full-connection layer, and dimensional transformation and affinity value calculation are performed on the full-connection layer.

The cyclic neural network model performs feature extraction, feature splicing and dimension transformation on HIV virus envelope protein sequence data and antibody protein sequence in a cyclic neural network layer, and performs interaction analysis on a fully-connected network in a fully-connected layer to ensure that the fully-connected network is excellent in the fitting problem of the data, and when the network becomes complex, the phenomenon of under fitting is solved, but the following problem is that the phenomenon of over fitting possibly occurs at any stage of model training, which is caused by over learning of the data features. In the training process, a dropout layer is added, the learning rate is degraded, the fitting can be effectively solved, and the shaking phenomenon in the training is avoided.

In one embodiment, the method for predicting affinity of an HIV antibody may further comprise a process of training a cyclic neural network model and performing affinity prediction, and the specific process comprises: dividing an input data set into a second training set, a verification set and a second test set according to the proportion of 8:1:1; training a cyclic neural network model by the second training set, setting a cyclic neural network unit in the cyclic neural network model as GRU or LSTM, taking the output of the cyclic neural network unit as the input of a full-connection layer, and carrying out dimension transformation and calculation of affinity value at the full-connection layer to obtain training parameters of the cyclic neural network model; and outputting second affinity prediction data according to the training parameters, and obtaining a second training result according to the verification set and the second test set.

The computer device may divide the input data set into a second training set, a validation set, and a second test set on a scale of 8:1:1; the random sampling of pandas disturbs the data sequence, and the index is reset to obtain three data sets of train, valid and test, wherein the number of each data set is 5134, 642 and 642 respectively. The average absolute error MAE, root mean square error RMSE, pearson correlation coefficient Pearson correlation, and determination coefficient R2 are used as performance evaluation indexes for deep learning. The data prediction performance is improved through the application of the single thermal coding extended data set and the cyclic neural network, and the average absolute error, the fitting degree and the performance of the test model on the independent data set are improved compared with the reference model.

After each training, the training effect of the model on the verification set is compared, and the comparison result is shown in the following table:

model performance comparison in HIV virus envelope protein gp160 sequence verification set

The table records the average performance of each index of the bidirectional circulating neural network in training, and the average performance is obtained by the average result of 100 rounds of training performance, and is compared with decision tree and random forest performance, bi-GRU is shown in R ² The best performance was shown on MAE, RMSE and Pearson, with Bi-LSTM performance slightly inferior to Bi-GRU. Wherein the average error of Bi-GRU is 9.3, the root mean square error is 16.44, the Pearson coefficient is 0.72, and the coefficient R is determined ² 0.49; the Bi-LSTM average error was 9.02, the root mean square error was 16.95, the Pearson's coefficient was 0.67, and the coefficient was determined to be 0.428. In the fitting effect, the Bi-GRU model performs optimally compared with decision trees, random forests and Bi-LSTM.

Different parameter settings affect the results, for example, the Bi-GRU model, as shown in the following table:

corresponding results of different Batch_Size settings in Bi-GRU model

Comparing the effect of the size of the batch_size on the result with other parameters unchanged: the convergence stability is theoretically favored when the small Batch dataset is increased, but when the Batch size is particularly large, the performance of the model is drastically reduced.

According to various attempts in the experiment, the optimal parameter setting scheme of the bidirectional cyclic neural network model on the HIV antibody dataset can be obtained: when Batch_size is set to 32, the Bi-GRU model can yield an absolute error of 9.3 for a determinant coefficient of 0.49 and a Pearson correlation coefficient of 0.72 on the dataset.

The computer device may compare the predicted results of the Bi-GRU model and the Bi-LSTM on the test set as shown in the following table:

comparison of the predictive Properties of two RNNs on the gp160 sequence of the envelope protein of the HIV virus

Wherein the test set is randomly divided from the total data set, accounting for 10% of the total data set, and for a protein interaction sequence data set consisting of eight neutralizing antibodies and target virus protein sequences thereof, each item of the Bi-GRU model is superior to the Bi-LSTM model in the test set.

The Bi-GRU model performs better on HIV antibody validation set and independent test set than the Bi-LSTM model. The recurrent neural network has a significant drawback in that when processing long sequences, the information within the network becomes more and more complex, even exceeding the memory capacity of the network, making the final output information chaotic and useless. In the GRU network unit, the gate function is less than that in the LSTM network unit, so that the parameter quantity is reduced, the parameter updating and the information spreading are easier, and the model training burden is reduced.

Step 210, determining an affinity prediction result according to the first training result and the second training result.

In one embodiment, the provided method for predicting affinity of an HIV antibody may further comprise a process for calculating an affinity value, the specific process comprising: inputting an input data set into a bidirectional cyclic neural unit and a linear layer through a cyclic neural network model to extract data characteristics, and outputting an antibody characteristic vector and a virus protein characteristic vector; the antibody characteristic vector and the virus protein characteristic vector are combined into a characteristic vector, the characteristic vector is input into a full-connection layer, and dimension transformation and affinity value calculation are carried out through the full-connection layer.

The method is characterized in that an HIV antibody affinity prediction model based on a cyclic neural network is provided by taking an HIV virus envelope protein gp160 sequence and an antibody sequence acting on a gp120 protein fragment in gp160 as research objects, the model respectively inputs data of the antibody and the virus protein sequence subjected to independent heat coding and amino acid attribute coding into a bidirectional cyclic neural unit and a linear layer to extract data characteristics of the sequence, and finally the output antibody characteristic vector and the virus protein characteristic vector are combined into a characteristic vector to be input into a full-connection network so as to predict protein interaction.

According to the characteristics of the sequence data set, a random forest, a decision tree reference model method, a bidirectional circulating neural network model, a circulating neural network fusing an attention mechanism and a multi-head attention mechanism which are suitable for learning the sequence data are built. In addition to constructing a reference data set and an independent test set of gp160 envelope protein of HIV virus, the gp120 sequence of the viral envelope protein of the corresponding antibody was taken as the independent test set. As can be seen from experimental results, the Bi-GRU model and the Bi-LSTM model perform optimally on the basis of the reference data set on the HIV antibody affinity prediction task, and memory cells in the neural network show strong processing long-distance dependency capability.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with at least a part of the sub-steps or stages of other steps or other steps.

In one embodiment, as shown in fig. 5, there is provided an HIV antibody affinity prediction system comprising: a dataset generation module 510, a preprocessing module 520, a reference machine learning model training module 530, a recurrent neural network model training module 540, and an affinity prediction module 550, wherein:

a data set generating module 510 for searching the database for the affinity data of the HIV viral proteins and the antibodies, the sequence data of the HIV viral proteins corresponding to the antibodies, and generating a data set by using the IC50 as a standard for measuring the affinity;

A preprocessing module 520, configured to perform data preprocessing on the data set to obtain an input data set;

a reference machine learning model training module 530 for constructing a reference machine learning model, dividing the input data set into a first training set and a first test set; training a reference machine learning model by using a first training set, adjusting parameters of the reference machine learning model, outputting first affinity prediction data, and obtaining a first training result according to a first test set;

the cyclic neural network model training module 540 is configured to construct a cyclic neural network model, divide the input data set into a second training set, a verification set, and a second test set, train the cyclic neural network model using the second training set, output second affinity prediction data, and obtain a second training result according to the second test set;

affinity prediction module 550 is configured to determine an affinity prediction result according to the first training result and the second training result.

In one embodiment, the data set generation module 510 is further configured to retrieve neutralizing antibody affinity data for HIV viral proteins from a public CATNAP database; IC50 as quantitative experimentally determined antibody affinity data; the data set was generated by sorting in the form of neutralizing antibody-HIV viral protein-IC 50.

In one embodiment, the data set generation module 510 is further configured to collect the search instruction through a search box; according to the search instruction, deleting the HIV virus data of unrecorded sequences in the public CATNAP database, and screening the affinity data of the HIV virus protein and the neutralizing antibody determined by experiments.

In one embodiment, the preprocessing module 520 is further configured to encode the data set, and encode the characteristics of the data in the data set into a digital form; normalizing the data characteristics in the numerical form to obtain an input data set; the data are characterized by amino acid numerical codes, single thermal codes and amino acid physicochemical property codes.

In one embodiment, the reference machine learning model includes a decision tree and a random forest; the reference machine learning model training module 530 is further configured to divide the input data set into a first training set and a first test set according to a ratio of 8:2; training a reference machine learning model by a first training set, and testing training results by a first testing set to obtain testing results; and adjusting the reference machine learning model parameters according to the test result.

In one embodiment, the recurrent neural network model comprises a two-way gating neural unit and a two-way long and short term memory network; the recurrent neural network model training module 540 is further configured to divide the input data set into a second training set, a verification set, and a second test set according to a ratio of 8:1:1; training a cyclic neural network model by the second training set, setting a cyclic neural network unit in the cyclic neural network model as GRU or LSTM, taking the output of the cyclic neural network unit as the input of a full-connection layer, and carrying out dimension transformation and calculation of affinity value at the full-connection layer to obtain training parameters of the cyclic neural network model; and outputting second affinity prediction data according to the training parameters, and obtaining a second training result according to the verification set and the second test set.

In one embodiment, affinity prediction module 550 is further configured to input the input data set into the bi-directional recurrent neural unit, the linear layer, extract data features, output antibody feature vectors and viral protein feature vectors through the recurrent neural network model; the antibody characteristic vector and the virus protein characteristic vector are combined into a characteristic vector, the characteristic vector is input into a full-connection layer, and dimension transformation and affinity value calculation are carried out through the full-connection layer.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for predicting affinity of an HIV antibody. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 6 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

performing data preprocessing on the data set to obtain an input data set;

constructing a reference machine learning model, and dividing an input data set into a first training set and a first testing set; training a reference machine learning model by using a first training set, adjusting parameters of the reference machine learning model, outputting first affinity prediction data, and obtaining a first training result according to a first test set;

Constructing a circulating neural network model, dividing an input data set into a second training set, a verification set and a second test set, training the circulating neural network model by using the second training set, outputting second affinity prediction data, and obtaining a second training result according to the second test set;

In one embodiment, the processor when executing the computer program further performs the steps of: searching the public CATNAP database for affinity data of HIV virus proteins and experimentally determined neutralizing antibodies; IC50 as quantitative experimentally determined antibody affinity data; the data set was generated by sorting in the form of neutralizing antibody-HIV viral protein-IC 50.

In one embodiment, the processor when executing the computer program further performs the steps of: collecting a retrieval instruction through a retrieval frame; according to the search instruction, deleting the HIV virus data of unrecorded sequences in the public CATNAP database, and screening the affinity data of the HIV virus protein and the neutralizing antibody determined by experiments.

In one embodiment, the processor when executing the computer program further performs the steps of: encoding the data set, and encoding the characteristics of the data in the data set into a digital form; normalizing the data characteristics in the numerical form to obtain an input data set; the data are characterized by amino acid numerical codes, single thermal codes and amino acid physicochemical property codes.

In one embodiment, the reference machine learning model includes a decision tree and a random forest; the processor when executing the computer program also implements the steps of: dividing an input data set into a first training set and a first test set according to the proportion of 8:2; training a reference machine learning model by a first training set, and testing training results by a first testing set to obtain testing results; and adjusting the reference machine learning model parameters according to the test result.

In one embodiment, the recurrent neural network model comprises a two-way gating neural unit and a two-way long and short term memory network; the processor when executing the computer program also implements the steps of: dividing an input data set into a second training set, a verification set and a second test set according to the proportion of 8:1:1; training a cyclic neural network model by the second training set, setting a cyclic neural network unit in the cyclic neural network model as GRU or LSTM, taking the output of the cyclic neural network unit as the input of a full-connection layer, and carrying out dimension transformation and calculation of affinity value at the full-connection layer to obtain training parameters of the cyclic neural network model; and outputting second affinity prediction data according to the training parameters, and obtaining a second training result according to the verification set and the second test set.

In one embodiment, the processor when executing the computer program further performs the steps of: inputting an input data set into a bidirectional cyclic neural unit and a linear layer through a cyclic neural network model to extract data characteristics, and outputting an antibody characteristic vector and a virus protein characteristic vector; the antibody characteristic vector and the virus protein characteristic vector are combined into a characteristic vector, the characteristic vector is input into a full-connection layer, and dimension transformation and affinity value calculation are carried out through the full-connection layer.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

performing data preprocessing on the data set to obtain an input data set;

In one embodiment, the computer program when executed by the processor further performs the steps of: searching the public CATNAP database for affinity data of HIV virus proteins and experimentally determined neutralizing antibodies; IC50 as quantitative experimentally determined antibody affinity data; the data set was generated by sorting in the form of neutralizing antibody-HIV viral protein-IC 50.

In one embodiment, the computer program when executed by the processor further performs the steps of: collecting a retrieval instruction through a retrieval frame; according to the search instruction, deleting the HIV virus data of unrecorded sequences in the public CATNAP database, and screening the affinity data of the HIV virus protein and the neutralizing antibody determined by experiments.

In one embodiment, the computer program when executed by the processor further performs the steps of: encoding the data set, and encoding the characteristics of the data in the data set into a digital form; normalizing the data characteristics in the numerical form to obtain an input data set; the data are characterized by amino acid numerical codes, single thermal codes and amino acid physicochemical property codes.

In one embodiment, the reference machine learning model includes a decision tree and a random forest; the computer program when executed by the processor also performs the steps of: dividing an input data set into a first training set and a first test set according to the proportion of 8:2; training a reference machine learning model by a first training set, and testing training results by a first testing set to obtain testing results; and adjusting the reference machine learning model parameters according to the test result.

In one embodiment, the recurrent neural network model comprises a two-way gating neural unit and a two-way long and short term memory network; the computer program when executed by the processor also performs the steps of: dividing an input data set into a second training set, a verification set and a second test set according to the proportion of 8:1:1; training a cyclic neural network model by the second training set, setting a cyclic neural network unit in the cyclic neural network model as GRU or LSTM, taking the output of the cyclic neural network unit as the input of a full-connection layer, and carrying out dimension transformation and calculation of affinity value at the full-connection layer to obtain training parameters of the cyclic neural network model; and outputting second affinity prediction data according to the training parameters, and obtaining a second training result according to the verification set and the second test set.

In one embodiment, the computer program when executed by the processor further performs the steps of: inputting an input data set into a bidirectional cyclic neural unit and a linear layer through a cyclic neural network model to extract data characteristics, and outputting an antibody characteristic vector and a virus protein characteristic vector; the antibody characteristic vector and the virus protein characteristic vector are combined into a characteristic vector, the characteristic vector is input into a full-connection layer, and dimension transformation and affinity value calculation are carried out through the full-connection layer.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for predicting affinity of an HIV antibody, comprising:

performing data preprocessing on the data set to obtain an input data set;

2. The method of claim 1, wherein searching the database for affinity data of HIV viral proteins to antibodies, corresponding to HIV viral protein sequence data of antibodies, and generating a data set using IC50 as a criterion for measuring affinity, comprises:

IC50 as quantitative experimentally determined antibody affinity data;

3. The method for predicting affinity of an HIV antibody according to claim 2, wherein the public CATNAP database stores HIV viral protein sequences and antibody protein sequences; searching the public CATNAP database for neutralizing antibody affinity data of HIV viral proteins with experimental assays, comprising:

Collecting a retrieval instruction through a retrieval frame;

4. The method of claim 1, wherein said data preprocessing of said data set to obtain an input data set comprises:

5. The HIV antibody affinity prediction method according to claim 1, wherein the baseline machine learning model comprises a decision tree and a random forest; dividing the input data set into a first training set and a first test set; training the reference machine learning model using the first training set and adjusting the reference machine learning model parameters, comprising:

6. The method of claim 1, wherein the recurrent neural network model comprises two-way gated neural units and two-way long and short term memory networks; dividing the input data set into a second training set, a validation set, and a second test set, training the recurrent neural network model using the second training set, comprising:

7. The method according to claim 6, wherein the step of performing dimension transformation and calculation of affinity value in the full-junction layer comprises:

8. An HIV antibody affinity prediction system, comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.