Detailed Description
The first embodiment is as follows: the frequency fraction prediction method based on the multi-mode and multi-task drug side effects in the embodiment comprises the following specific processes:
step one, obtaining chemical structure semantic features of drug molecules, chemical sequence semantic features of drug molecules, biomedical text features of drugs and biomedical text features of side effects;
obtaining a first drug side effect pair based on chemical structure semantic features of the drug molecules, chemical sequence semantic features of the drug molecules, biomedical text features of the drug and biomedical text features of side effects;
step two, a similarity feature interactive learning module:
compared with the extraction of information from the biochemical semantic information of abundant drugs and side effects, the similarity information between the drugs and the side effects can be learned, so that the deep relationship between the drugs and the side effects is obtained;
step two, calculating the similarity information of the medicines and the similarity information of side effects through Jacquard similarity and cosine similarity, and mapping the similarity information of the medicines and the similarity information of the side effects to the same dimension;
the similarity information of the medicines is a medicine-disease similarity matrix and a medicine-medicine similarity matrix;
the similarity information of the side effects is a similarity matrix between the side effects, the word vectors between the side effects represent the similarity matrix between the medicine and the side effects;
step two, based on the similarity matrix between the medicine and the disease, the similarity matrix between the medicine and the similarity matrix between the side effects, the word vector between the side effects represents, and the similarity matrix between the medicine and the side effects obtains a second medicine side effect pair;
step three, fusing two modules and predicting frequency;
and (3) predicting the side effects of the medicines respectively learned in the first step and the second step by serially feeding the medicines into a multi-layer perceptron, and predicting whether the medicines are associated with the side effects and the frequency fraction of the medicines and the side effects when the medicines are associated with the side effects.
The second embodiment is as follows: the first difference between this embodiment and the specific embodiment is that: obtaining chemical structure semantic features of drug molecules, chemical sequence semantic features of drug molecules, biomedical text features of drugs and biomedical text features of side effects in the first step;
obtaining a first drug side effect pair based on chemical structure semantic features of the drug molecules, chemical sequence semantic features of the drug molecules, biomedical text features of the drug and biomedical text features of side effects;
a multi-modal semantic representation learning module: the chemical structure semantics, chemical sequence semantics and biomedical semantic information of the drug molecules can all represent the biological properties of the drug, so that we learn the corresponding characterization from three modes of the drug;
the specific process is as follows:
step one, selecting a graph attention neural network GAT to process the drug molecules, and obtaining chemical structure semantic features (learning molecular graph representation) of the drug molecules;
step two, a transducer module is selected to process the drug molecules, so that chemical sequence semantic characteristics of the drug molecules are obtained;
step three, acquiring biomedical text characteristics of medicines and biomedical text characteristics of side effects;
step four, the chemical structure semantic features of the medicine molecules extracted in step one, the chemical sequence semantic features of the medicine molecules extracted in step two and the biomedical text features of the medicines extracted in step three are respectively reduced to the same dimension through the full-connection layer, so that the chemical structure semantic features of the medicine molecules after dimension reduction, the chemical sequence semantic features of the medicine molecules after dimension reduction, the biomedical text features of the medicines after dimension reduction and the biomedical text features of the side effects after dimension reduction are obtained;
step one five, and in order to obtain a fine-grained fusion between drug and side effects;
calculating the chemical structure semantic characteristics of the drug molecules after dimension reduction and the representation 1 of biomedical text characteristics of side effects by using element-level product operation;
calculating the chemical sequence semantic features of the drug molecules after dimension reduction and the representation 2 of biomedical text features of side effects by using the product operation of element levels;
calculating a representation 3 of biomedical text characteristics of the reduced-dimension drug and biomedical text characteristics of side effects using element-level product operations;
and adding the characterization 1, the characterization 2 and the characterization 3, and then sending the added characterization 1, the characterization 2 and the characterization 3 into a full-connection layer, wherein the output characteristics of the full-connection layer sequentially pass through an activation function and normalize the layers in batches to obtain a drug side effect pair learned by a first module.
Other steps and parameters are the same as in the first embodiment.
And a third specific embodiment: this embodiment differs from the first or second embodiment in that: selecting a graph attention neural network GAT to process the drug molecules one by one to obtain chemical structure semantic features (learning molecular graph representation) of the drug molecules; the specific process is as follows:
collecting the SMILES sequence of the drug molecule, and converting the SMILES sequence of the drug molecule into an undirected molecular graph G through an RDkit tool;
undirected molecular graph g= (V, E);
where V represents an atom set, shown as v= { C, H, O … Sr }, E represents a set of chemical bonds between atoms;
constructing a feature matrix of the drug molecules by using unique chemical properties of atoms to represent one-hot vectors;
constructing an adjacent matrix of the drug molecules by utilizing a two-dimensional structure of the drug molecules, wherein each atom of the drug is expressed as a node, if bonds exist in two atoms, setting rows and columns corresponding to the two atom nodes as 1 in a neighbor matrix, and setting the rows and columns corresponding to the two atom nodes as 0 if bonds exist between the two atoms; (each drug has an unique structure in a two-dimensional plane, each atom (C, H, O, etc.) in the structure learns the representation of each atom in the drug by polymerizing surrounding neighbors,
the characteristic matrix of the drug molecules and the adjacent matrix of the drug molecules are input into the graph attention neural network GAT, the graph attention neural network GAT output characteristics are input into the maximum pooling layer, and the maximum pooling layer outputs the chemical structure semantic characteristics (learning molecular graph representation) of the drug molecules.
Other steps and parameters are the same as in the first or second embodiment.
The specific embodiment IV is as follows: this embodiment differs from one of the first to third embodiments in that: in the first step, a transducer module is selected to process the drug molecules, so that chemical sequence semantic characteristics of the drug molecules are obtained; the specific process is as follows:
obtaining sub-sequences in an existing corpus (using the corpus provided in the MolTrans paper of yellow et al 2021, which works to mine millions of drug molecular sequences from multiple unlabeled data sources to extract high frequency drug sub-sequences from all molecular sequences, thereby extracting a sub-sequence corpus)
Collecting the SMILES sequence of the drug molecule;
decomposing the SMILES sequence of a drug molecule into subsequences by BPE algorithm and corpus(after a molecular sequence is subjected to BPE algorithm, the whole molecular sequence is divided according to the subsequence with high frequency of occurrence in the corpus, so that every subsequence with reasonable division is obtained, the atomic number of some medicine molecules is up to five-six hundred, the chemical sequence representing the medicine is overlong, and the medicine is used for general purpose)The general deep learning method is not good in understanding the semantic effect of a long sequence, and the effect of the drug in the human body is not a single element, but is removed in the form of a functional group, so that the decomposition of the sequence into individual functional groups is of practical significance. Word lists of frequent subsequences of counted functional groups are utilized. Most frequent subsequence word segmentation of sequences of drug molecules in a dataset
Wherein d is i A SMILES sequence that is the i-th drug molecule;
s j SMILES sequence d for the jth drug molecule i Is selected from the group consisting of a sub-sequence of (a),
then sending the subsequence into a transducer module, and extracting chemical sequence semantic features of the drug molecules (the chemical sequence semantics of the drug molecules are numerical values) through multi-head attention, residual error connection, regularization and a last linear layer; the specific process is as follows:
setting up
a 1 =ma 2
MultiHead(Q,K,V)=Concat(h 1 ,…,h m )W
Wherein Attention represents Attention weight; q represents a matrix to be queried, K is an index matrix, and V is a matrix obtained after weighting according to the attention weight; multiHead represents a matrix obtained by concatenating m attention headers; concat means that the multi-head attention mechanism results are spliced; w represents a learnable parameter matrix; h is a m Representing the result of the mth attention head learning; a, a 1 Representing dimension size, a 2 Representing the feature dimension of the setting;is a parameter matrix, m is the number of attention heads; softmax represents the mapping of the inner product of Q and K to [0,1 ]]Probability distribution between, representing the attention weight; t represents a transpose;Is the vector dimension.
Other steps and parameters are the same as in one to three embodiments.
Fifth embodiment: this embodiment differs from one to four embodiments in that: acquiring biomedical text characteristics of the medicine and biomedical text characteristics of side effects in the step one; the specific process is as follows:
collecting biomedical text (such as Alprostadil is a medication used to treat erectile dysfunction) of the drug from WIKI or PubChem;
collecting biomedical text (e.g., ascites is the abnormal build-up of fluid in the abdomen) of the set of side effects from WIKI or PubCHem;
in order to avoid possible data leakage, there is no interaction between the drug and the side effects in the collected biomedical text information. For example, the inability to allow "etoposide to frequently cause nausea, vomiting, and about failure" in collected biomedical text data;
respectively inputting biomedical text information of the medicine and biomedical text information of side effects into a BioBert pre-training model in the biological field to extract biomedical text characteristics of the medicine and the side effects; expressed as:
where N is the number of drugs or side effects and f is the output dimension of the BioBert pre-training model; r is a real number, and the R is a real number,biomedical article being a drugThe present feature is->Is biomedical text feature of side effects;Is->The medical text of the individual drugs is presented,is->Medical text for individual side effects; BERT is a BioBert pre-training model.
Other steps and parameters are the same as those of embodiments one to four to one.
Specific embodiment six: this embodiment differs from one of the first to fifth embodiments in that: the step one five expression is:
the fine-grained fusion of the drug and side effects is extracted by the elemental-level product operation to capture the deep and comprehensive relationship between drug and side effects.
Where σ is the activation function, W is the matrix of learnable parameters,is the chemical structural semantic feature of the drug molecule, +.>Is the chemical sequence semantic feature of the drug molecule, +.>Is a medicineBiomedical text character of the substance, +.A is expressed as p +.A-> Respectively and->Element-level multiplication operation is performed between every two (for +.>And->Performing element-level product operation between every two pairs, and performing +.>And->Performing element-level product operation between every two pairs, and performing +.>And->Element-level product operation between two pairs), and +.>Biomedical text features, P, which are side effects 1 Is the first pair of module learned drug side effects, sum is the vector addition operation.
Other steps and parameters are the same as those of embodiments one to five to one.
Seventh embodiment: this embodiment differs from one of the first to sixth embodiments in that: the process for obtaining the similarity matrix between the medicine and the disease in the second step is as follows:
extracting association relations between medicines and diseases from a Comparative Toxicology Database (CTD) (330397 association relations between all medicines and 6808 diseases are in the comparative toxicology database), obtaining a medicine-disease association matrix based on the association relations between medicines and diseases, and performing cosine similarity calculation on the medicine-disease association matrix to obtain a medicine-disease similarity matrix;
for example, i collect 6808 diseases and have 750 drugs in total, i can construct a matrix of 750 x 6808, so i can fill in the 330397 association relations according to the abscissa, and only obtain an association matrix between the drugs and the diseases, and cosine operation is performed on the matrix to obtain a similarity matrix of 750 x 750.
The medicine-medicine similarity matrix obtaining process in the step two is as follows:
querying a drug-drug similarity score through a STITCT database;
the similarity score of each group of medicines is 0-1000, and then the score of 0-1000 is compressed to 0-1 in the same proportion;
the drug-drug similarity scores of all groups constitute a drug-drug similarity matrix.
For example, i collect similarity scores of m drugs and m drugs, i can construct an m×m matrix, and the compressed similarity scores are input into the matrix.
Other steps and parameters are the same as in one of the first to sixth embodiments.
Eighth embodiment: this embodiment differs from one of the first to seventh embodiments in that: the similarity matrix acquisition process between the side effects in the step two is as follows:
obtaining side effect information from the ADReCS database (each side effect is a node on the tree, the side effect information is a node position);
the adrcs database is defined as a four-level tree dataset, each Adverse Drug Reaction (ADR) item being assigned a unique ID; for example, in the adrcs dataset, the ID of polycythemia is 14.12.01.002.
Constructing a matrix based on the obtained side effect information;
if the side effects have no common father node, the similarity of the side effects is 0;
if pairwise side effects have a common parent node, the pairwise side effects are similar to μ (μ=0.5);
if parent nodes of side effects are at a higher level, the similarity between side effects is μ 2 (μ 2 =0.5×0.5=0.25);
Cycling until the similarity among all the side effects is calculated;
filling the similarity among all the side effects into a matrix to obtain a similarity matrix among the side effects;
for example, n kinds of side effect information are collected, so that an n×n matrix can be constructed;
the similarity matrix between the side effects is calculated by node positions of the side effects in the four-level tree data.
The word vector representation acquisition process between the side effects in the step two is as follows:
obtaining a data set consisting of q side effect words;
inputting each side effect word in the data set into a trained Glove model to output p-dimensional features;
inputting a trained Glove model to q side effect words to obtain a p multiplied by q feature matrix;
cosine similarity calculation is carried out on the p multiplied by q feature matrix, so that word vector representation among side effects is obtained;
for example, 1000 side effect words are input into a trained Glove model, each side effect outputs a 300-dimensional feature, a feature matrix of 1000 x 300 is obtained, and a cosine operation is performed to obtain a similarity matrix of 1000 x 1000.
The similarity matrix acquisition process between the medicine and the side effect in the step two is as follows:
extracting the association relation between the medicine and the side effect from the training set, and obtaining a medicine-side effect association matrix based on the association relation between the medicine and the side effect;
transposed medicine-side effect incidence matrix, and then performing cosine similarity calculation to obtain similarity matrix between medicine and side effect;
for example, c medicines and d side effects are collected, a c×d matrix can be constructed, so that the association relation can be filled in according to the abscissa and the ordinate to obtain an association matrix of the medicines and the side effects, and cosine similarity calculation is performed after the matrix is transposed to obtain a d×d similarity matrix;
thus two drug characterization and three side effect characterization were obtained;
other steps and parameters are the same as those of one of the first to seventh embodiments.
Detailed description nine: this embodiment differs from one to eight of the embodiments in that: in the second step, based on the similarity matrix between medicines and diseases, the similarity matrix between medicines and the similarity matrix between side effects, the word vector between the side effects represents, and the similarity matrix between medicines and the side effects obtains a second medicine side effect pair; the specific process is as follows:
step two and one,
Each row of the drug-disease similarity matrix is a feature;
each row of the drug-drug similarity matrix is a feature;
each row of the similarity matrix between side effects is a feature;
one feature for each behavior represented by a word vector between side effects;
each row of the similarity matrix between the drug and the side effect is a feature;
carrying out outer product operation on all features of the similarity matrix between the medicine and the disease and each feature in the similarity matrix between the side effects, word vector representation between the side effects and the similarity matrix between the medicine and the side effects to obtain 3 matrices;
carrying out outer product operation on all features of the drug-drug similarity matrix and each feature in the similarity matrix between side effects, word vector representation between side effects and the similarity matrix between drugs and side effects respectively to obtain 3 matrices;
performing outer product operation two by two to obtain a multi-channel matrix;
6 matrices are input into a two-dimensional convolutional neural network to learn deep representation of drugs and side effects;
the expression is:
wherein the method comprises the steps ofIs the nth drug-inter-disease similarity matrix or the ith row of the drug-inter-drug similarity matrix,/->Is the similarity matrix between the m-th side effects, the word vector representation between the side effects or the j-th line of the similarity matrix between the drug and the side effects, prot is the vector outer product operation, +.>Is a drug side effect pair; CNN is a two-dimensional convolutional neural network;
step two by two,
Each row of the drug-disease similarity matrix is a feature;
each row of the drug-drug similarity matrix is a feature;
each row of the similarity matrix between side effects is a feature;
one feature for each behavior represented by a word vector between side effects;
each row of the similarity matrix between the drug and the side effect is a feature;
carrying out element-level multiplication on all features of the similarity matrix between the medicine and the disease and each feature in the similarity matrix between the side effects, word vector representation between the side effects and the similarity matrix between the medicine and the side effects to obtain 3 vectors;
carrying out element-level multiplication on all features of the drug-drug similarity matrix, the similarity matrix between side effects, word vector representation between side effects and each feature in the similarity matrix between the drug and the side effects to obtain 3 vectors;
adding and inputting the 6 vectors into a fully-connected network to extract fusion characteristics with fine granularity;
the expression is:
wherein the method comprises the steps ofIs the nth drug-inter-disease similarity matrix or the ith row of the drug-inter-drug similarity matrix,/->Is the similarity matrix between the mth side effects, the word vector representation between the side effects or the j-th line of the similarity matrix between the drug and the side effects, +.,is a drug side effect pair;
step two, two and three,
Two drug side effects are serially connected to a fully-connected neural network
Where I is the parameter matrix representing the join operation, W is the parameter matrix that can be learned, P 2 Is a second drug side effect pair;
for example, a characteristic dimension is 1×200, and a characteristic dimension is also 1×200, and the concatenation becomes 1×400.
Other steps and parameters are the same as in one to eight of the embodiments.
Detailed description ten: this embodiment differs from one of the embodiments one to nine in that: in the third step, the side effects of the medicines learned in the first step and the second step are sent into a multi-layer perceptron in series to be predicted, whether the medicines are associated with the side effects or not is predicted, and the frequency scores of the medicines and the side effects are calculated when the medicines are associated with the side effects; the expression is:
y=MLP(P 1 ||P 2 )
wherein MLP is a multi-layer perceptron; y outputs an association score and a frequency score between drug side effect pairs.
Other steps and parameters are the same as in one of the first to ninth embodiments.
Experimental performance statistical measures of the error between the true and predicted samples were evaluated by using AUROC (area under Roc curve), AUPR (area under PR curve), RMSE (root mean square error) and MAE (mean absolute error)) as evaluation measures.
Table one: verification method test results
The present invention is capable of other and further embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.