CN116504331A

CN116504331A - A Frequency Score Prediction Method for Drug Side Effects Based on Multimodality and Multitask

Info

Publication number: CN116504331A
Application number: CN202310479801.XA
Authority: CN
Inventors: 李洋; 汪国华; 刘武勇
Original assignee: Northeast Forestry University
Current assignee: Northeast Forestry University
Priority date: 2023-04-28
Filing date: 2023-04-28
Publication date: 2023-07-28
Anticipated expiration: 2043-04-28
Also published as: CN116504331B

Abstract

A frequency score prediction method for drug side effects based on multi-modality and multi-task, the present invention relates to a frequency score method for predicting drug side effects by deep learning technology. The purpose of the present invention is to solve the problem of low accuracy rate of the existing calculation method for discriminating the relationship between drugs and side effects, and low accuracy rate of frequency score prediction for drugs and side effects. The process is: 1. Obtain the chemical structure semantic features of the drug molecule, the chemical sequence semantic features of the drug molecule, the biomedical text features of the drug and the biomedical text features of the side effects; obtain the drug side effect pair; 2. Calculate the similarity information of the drug and The similarity information of side effects; obtain the drug side effect pair; 3. Send the learned drug side effect pair into the multi-layer perceptron in series to predict whether there is a relationship between the drug and the side effect and the frequency score of the drug and the side effect when there is an association. The invention belongs to the technical field of frequency prediction between drugs and side effects.

Description

Frequency score prediction method for drug side effects based on multiple modes and multiple tasks

Technical Field

The invention belongs to the technical field of frequency prediction between medicines and side effects, and particularly relates to a frequency score method for predicting the side effects of medicines by a deep learning technology.

Background

The side effect is an unexpected reaction of the drug in the human body. Most of the side effects are harmful and may put a great burden on public health and even endanger life. Meanwhile, side effects are also a main cause of drug development failure, and in the process of drug development, researchers can conduct a large number of animal tests and clinical tests to determine the side effects of drugs in production. Therefore, at an early stage of drug development, it is important to identify possible side effects of the drug and solve problems related to safety. It is very time consuming and expensive to explore accurate drug side effect signatures at the experimental stage, and thus predicting side effects not found with new and existing drugs using existing signature information becomes a critical issue.

In recent years, with the development of calculation methods, many researchers have utilized calculation methods to predict side effects, which gives researchers a deeper understanding of the mechanism of drug side effect interactions, which is expected to guide the development of safer, more effective drugs. In addition, the side effects related to the medicine are identified by utilizing the computer technology, so that the screening success rate of medicine research and development is improved, and biological explanation is provided for medicine repositioning and medicine pathology development.

Disclosure of Invention

The invention aims to solve the problems that the existing calculation method has low accuracy in judging the association relation between the medicine and the side effect and has low accuracy in predicting the frequency fraction of the medicine and the side effect, and the frequency fraction predicting method of the association relation between the medicine and the side effect.

The frequency fraction prediction method based on the multi-mode and multi-task drug side effects comprises the following specific processes:

step one, obtaining chemical structure semantic features of drug molecules, chemical sequence semantic features of drug molecules, biomedical text features of drugs and biomedical text features of side effects;

obtaining a first drug side effect pair based on chemical structure semantic features of the drug molecules, chemical sequence semantic features of the drug molecules, biomedical text features of the drug and biomedical text features of side effects;

step two,

Step two, calculating the similarity information of the medicines and the similarity information of side effects through Jacquard similarity and cosine similarity, and mapping the similarity information of the medicines and the similarity information of the side effects to the same dimension;

the similarity information of the medicines is a medicine-disease similarity matrix and a medicine-medicine similarity matrix;

the similarity information of the side effects is a similarity matrix between the side effects, the word vectors between the side effects represent the similarity matrix between the medicine and the side effects;

step two, based on the similarity matrix between the medicine and the disease, the similarity matrix between the medicine and the similarity matrix between the side effects, the word vector between the side effects represents, and the similarity matrix between the medicine and the side effects obtains a second medicine side effect pair;

and thirdly, predicting the side effects of the medicines respectively learned in the first step and the second step by serially feeding the medicines into a multi-layer perceptron, and predicting whether the medicines are associated with the side effects and the frequency fraction of the medicines and the side effects when the medicines are associated with the side effects.

The beneficial effects of the invention are as follows:

most existing supervision models treat drug side effects as binary predictive tasks, excessively simplifying the complexity of drug-related side effects. Research on the frequency of side effects of drugs may provide a deeper explanation than otherwise possible. The work can predict the association between the two, and also predict the frequency score between the medicine and the side effect, thereby realizing the multi-task learning.

A method for predicting relevant side effect frequency scores by learning characteristics among different modes of a drug and similarity characteristics among drug side effects is provided. The model can implicitly establish complex relations among a plurality of modes by using different mode data and different deep learning methods for learning corresponding features. In addition, extraction and fusion of drug and side effect features using biomedical pre-trained models has been innovatively proposed. In order to improve the accuracy of the prediction score, a similarity characteristic interaction module is also designed, and local and fine-grained characteristics of drug side effect pairs are learned based on similarity information between drugs and side effects. The multi-modal feature learning method and the similarity feature learning method have great potential to be expanded to other tasks such as drug target association prediction and drug disease association prediction, and a novel research angle is provided in the field of association prediction research of bioinformatics.

Drawings

FIG. 1 is a schematic diagram of predicting the frequency of side effects of a drug based on fusing various external knowledge.

Detailed Description

The first embodiment is as follows: the frequency fraction prediction method based on the multi-mode and multi-task drug side effects in the embodiment comprises the following specific processes:

step two, a similarity feature interactive learning module:

compared with the extraction of information from the biochemical semantic information of abundant drugs and side effects, the similarity information between the drugs and the side effects can be learned, so that the deep relationship between the drugs and the side effects is obtained;

step three, fusing two modules and predicting frequency;

and (3) predicting the side effects of the medicines respectively learned in the first step and the second step by serially feeding the medicines into a multi-layer perceptron, and predicting whether the medicines are associated with the side effects and the frequency fraction of the medicines and the side effects when the medicines are associated with the side effects.

The second embodiment is as follows: the first difference between this embodiment and the specific embodiment is that: obtaining chemical structure semantic features of drug molecules, chemical sequence semantic features of drug molecules, biomedical text features of drugs and biomedical text features of side effects in the first step;

a multi-modal semantic representation learning module: the chemical structure semantics, chemical sequence semantics and biomedical semantic information of the drug molecules can all represent the biological properties of the drug, so that we learn the corresponding characterization from three modes of the drug;

the specific process is as follows:

step one, selecting a graph attention neural network GAT to process the drug molecules, and obtaining chemical structure semantic features (learning molecular graph representation) of the drug molecules;

step two, a transducer module is selected to process the drug molecules, so that chemical sequence semantic characteristics of the drug molecules are obtained;

step three, acquiring biomedical text characteristics of medicines and biomedical text characteristics of side effects;

step four, the chemical structure semantic features of the medicine molecules extracted in step one, the chemical sequence semantic features of the medicine molecules extracted in step two and the biomedical text features of the medicines extracted in step three are respectively reduced to the same dimension through the full-connection layer, so that the chemical structure semantic features of the medicine molecules after dimension reduction, the chemical sequence semantic features of the medicine molecules after dimension reduction, the biomedical text features of the medicines after dimension reduction and the biomedical text features of the side effects after dimension reduction are obtained;

step one five, and in order to obtain a fine-grained fusion between drug and side effects;

calculating the chemical structure semantic characteristics of the drug molecules after dimension reduction and the representation 1 of biomedical text characteristics of side effects by using element-level product operation;

calculating the chemical sequence semantic features of the drug molecules after dimension reduction and the representation 2 of biomedical text features of side effects by using the product operation of element levels;

calculating a representation 3 of biomedical text characteristics of the reduced-dimension drug and biomedical text characteristics of side effects using element-level product operations;

and adding the characterization 1, the characterization 2 and the characterization 3, and then sending the added characterization 1, the characterization 2 and the characterization 3 into a full-connection layer, wherein the output characteristics of the full-connection layer sequentially pass through an activation function and normalize the layers in batches to obtain a drug side effect pair learned by a first module.

Other steps and parameters are the same as in the first embodiment.

And a third specific embodiment: this embodiment differs from the first or second embodiment in that: selecting a graph attention neural network GAT to process the drug molecules one by one to obtain chemical structure semantic features (learning molecular graph representation) of the drug molecules; the specific process is as follows:

collecting the SMILES sequence of the drug molecule, and converting the SMILES sequence of the drug molecule into an undirected molecular graph G through an RDkit tool;

undirected molecular graph g= (V, E);

where V represents an atom set, shown as v= { C, H, O … Sr }, E represents a set of chemical bonds between atoms;

constructing a feature matrix of the drug molecules by using unique chemical properties of atoms to represent one-hot vectors;

constructing an adjacent matrix of the drug molecules by utilizing a two-dimensional structure of the drug molecules, wherein each atom of the drug is expressed as a node, if bonds exist in two atoms, setting rows and columns corresponding to the two atom nodes as 1 in a neighbor matrix, and setting the rows and columns corresponding to the two atom nodes as 0 if bonds exist between the two atoms; (each drug has an unique structure in a two-dimensional plane, each atom (C, H, O, etc.) in the structure learns the representation of each atom in the drug by polymerizing surrounding neighbors,

the characteristic matrix of the drug molecules and the adjacent matrix of the drug molecules are input into the graph attention neural network GAT, the graph attention neural network GAT output characteristics are input into the maximum pooling layer, and the maximum pooling layer outputs the chemical structure semantic characteristics (learning molecular graph representation) of the drug molecules.

Other steps and parameters are the same as in the first or second embodiment.

The specific embodiment IV is as follows: this embodiment differs from one of the first to third embodiments in that: in the first step, a transducer module is selected to process the drug molecules, so that chemical sequence semantic characteristics of the drug molecules are obtained; the specific process is as follows:

obtaining sub-sequences in an existing corpus (using the corpus provided in the MolTrans paper of yellow et al 2021, which works to mine millions of drug molecular sequences from multiple unlabeled data sources to extract high frequency drug sub-sequences from all molecular sequences, thereby extracting a sub-sequence corpus)

Collecting the SMILES sequence of the drug molecule;

decomposing the SMILES sequence of a drug molecule into subsequences by BPE algorithm and corpus(after a molecular sequence is subjected to BPE algorithm, the whole molecular sequence is divided according to the subsequence with high frequency of occurrence in the corpus, so that every subsequence with reasonable division is obtained, the atomic number of some medicine molecules is up to five-six hundred, the chemical sequence representing the medicine is overlong, and the medicine is used for general purpose)The general deep learning method is not good in understanding the semantic effect of a long sequence, and the effect of the drug in the human body is not a single element, but is removed in the form of a functional group, so that the decomposition of the sequence into individual functional groups is of practical significance. Word lists of frequent subsequences of counted functional groups are utilized. Most frequent subsequence word segmentation of sequences of drug molecules in a dataset

Wherein d is _i A SMILES sequence that is the i-th drug molecule;

s _j SMILES sequence d for the jth drug molecule _i Is selected from the group consisting of a sub-sequence of (a),

then sending the subsequence into a transducer module, and extracting chemical sequence semantic features of the drug molecules (the chemical sequence semantics of the drug molecules are numerical values) through multi-head attention, residual error connection, regularization and a last linear layer; the specific process is as follows:

setting up

a ₁ ＝ma ₂

MultiHead(Q,K,V)＝Concat(h ₁ ,…,h _m )W

Wherein Attention represents Attention weight; q represents a matrix to be queried, K is an index matrix, and V is a matrix obtained after weighting according to the attention weight; multiHead represents a matrix obtained by concatenating m attention headers; concat means that the multi-head attention mechanism results are spliced; w represents a learnable parameter matrix; h is a _m Representing the result of the mth attention head learning; a, a ₁ Representing dimension size, a ₂ Representing the feature dimension of the setting;is a parameter matrix, m is the number of attention heads; softmax represents the mapping of the inner product of Q and K to [0,1 ]]Probability distribution between, representing the attention weight; t represents a transpose;Is the vector dimension.

Other steps and parameters are the same as in one to three embodiments.

Fifth embodiment: this embodiment differs from one to four embodiments in that: acquiring biomedical text characteristics of the medicine and biomedical text characteristics of side effects in the step one; the specific process is as follows:

collecting biomedical text (such as Alprostadil is a medication used to treat erectile dysfunction) of the drug from WIKI or PubChem;

collecting biomedical text (e.g., ascites is the abnormal build-up of fluid in the abdomen) of the set of side effects from WIKI or PubCHem;

in order to avoid possible data leakage, there is no interaction between the drug and the side effects in the collected biomedical text information. For example, the inability to allow "etoposide to frequently cause nausea, vomiting, and about failure" in collected biomedical text data;

respectively inputting biomedical text information of the medicine and biomedical text information of side effects into a BioBert pre-training model in the biological field to extract biomedical text characteristics of the medicine and the side effects; expressed as:

where N is the number of drugs or side effects and f is the output dimension of the BioBert pre-training model; r is a real number, and the R is a real number,biomedical article being a drugThe present feature is->Is biomedical text feature of side effects;Is->The medical text of the individual drugs is presented,is->Medical text for individual side effects; BERT is a BioBert pre-training model.

Other steps and parameters are the same as those of embodiments one to four to one.

Specific embodiment six: this embodiment differs from one of the first to fifth embodiments in that: the step one five expression is:

the fine-grained fusion of the drug and side effects is extracted by the elemental-level product operation to capture the deep and comprehensive relationship between drug and side effects.

Where σ is the activation function, W is the matrix of learnable parameters,is the chemical structural semantic feature of the drug molecule, +.>Is the chemical sequence semantic feature of the drug molecule, +.>Is a medicineBiomedical text character of the substance, +.A is expressed as p +.A-> Respectively and->Element-level multiplication operation is performed between every two (for +.>And->Performing element-level product operation between every two pairs, and performing +.>And->Performing element-level product operation between every two pairs, and performing +.>And->Element-level product operation between two pairs), and +.>Biomedical text features, P, which are side effects ¹ Is the first pair of module learned drug side effects, sum is the vector addition operation.

Other steps and parameters are the same as those of embodiments one to five to one.

Seventh embodiment: this embodiment differs from one of the first to sixth embodiments in that: the process for obtaining the similarity matrix between the medicine and the disease in the second step is as follows:

extracting association relations between medicines and diseases from a Comparative Toxicology Database (CTD) (330397 association relations between all medicines and 6808 diseases are in the comparative toxicology database), obtaining a medicine-disease association matrix based on the association relations between medicines and diseases, and performing cosine similarity calculation on the medicine-disease association matrix to obtain a medicine-disease similarity matrix;

for example, i collect 6808 diseases and have 750 drugs in total, i can construct a matrix of 750 x 6808, so i can fill in the 330397 association relations according to the abscissa, and only obtain an association matrix between the drugs and the diseases, and cosine operation is performed on the matrix to obtain a similarity matrix of 750 x 750.

The medicine-medicine similarity matrix obtaining process in the step two is as follows:

querying a drug-drug similarity score through a STITCT database;

the similarity score of each group of medicines is 0-1000, and then the score of 0-1000 is compressed to 0-1 in the same proportion;

the drug-drug similarity scores of all groups constitute a drug-drug similarity matrix.

For example, i collect similarity scores of m drugs and m drugs, i can construct an m×m matrix, and the compressed similarity scores are input into the matrix.

Other steps and parameters are the same as in one of the first to sixth embodiments.

Eighth embodiment: this embodiment differs from one of the first to seventh embodiments in that: the similarity matrix acquisition process between the side effects in the step two is as follows:

obtaining side effect information from the ADReCS database (each side effect is a node on the tree, the side effect information is a node position);

the adrcs database is defined as a four-level tree dataset, each Adverse Drug Reaction (ADR) item being assigned a unique ID; for example, in the adrcs dataset, the ID of polycythemia is 14.12.01.002.

Constructing a matrix based on the obtained side effect information;

if the side effects have no common father node, the similarity of the side effects is 0;

if pairwise side effects have a common parent node, the pairwise side effects are similar to μ (μ=0.5);

if parent nodes of side effects are at a higher level, the similarity between side effects is μ ² (μ ² ＝0.5×0.5＝0.25)；

Cycling until the similarity among all the side effects is calculated;

filling the similarity among all the side effects into a matrix to obtain a similarity matrix among the side effects;

for example, n kinds of side effect information are collected, so that an n×n matrix can be constructed;

the similarity matrix between the side effects is calculated by node positions of the side effects in the four-level tree data.

The word vector representation acquisition process between the side effects in the step two is as follows:

obtaining a data set consisting of q side effect words;

inputting each side effect word in the data set into a trained Glove model to output p-dimensional features;

inputting a trained Glove model to q side effect words to obtain a p multiplied by q feature matrix;

cosine similarity calculation is carried out on the p multiplied by q feature matrix, so that word vector representation among side effects is obtained;

for example, 1000 side effect words are input into a trained Glove model, each side effect outputs a 300-dimensional feature, a feature matrix of 1000 x 300 is obtained, and a cosine operation is performed to obtain a similarity matrix of 1000 x 1000.

The similarity matrix acquisition process between the medicine and the side effect in the step two is as follows:

extracting the association relation between the medicine and the side effect from the training set, and obtaining a medicine-side effect association matrix based on the association relation between the medicine and the side effect;

transposed medicine-side effect incidence matrix, and then performing cosine similarity calculation to obtain similarity matrix between medicine and side effect;

for example, c medicines and d side effects are collected, a c×d matrix can be constructed, so that the association relation can be filled in according to the abscissa and the ordinate to obtain an association matrix of the medicines and the side effects, and cosine similarity calculation is performed after the matrix is transposed to obtain a d×d similarity matrix;

thus two drug characterization and three side effect characterization were obtained;

other steps and parameters are the same as those of one of the first to seventh embodiments.

Detailed description nine: this embodiment differs from one to eight of the embodiments in that: in the second step, based on the similarity matrix between medicines and diseases, the similarity matrix between medicines and the similarity matrix between side effects, the word vector between the side effects represents, and the similarity matrix between medicines and the side effects obtains a second medicine side effect pair; the specific process is as follows:

step two and one,

Each row of the drug-disease similarity matrix is a feature;

each row of the drug-drug similarity matrix is a feature;

each row of the similarity matrix between side effects is a feature;

one feature for each behavior represented by a word vector between side effects;

each row of the similarity matrix between the drug and the side effect is a feature;

carrying out outer product operation on all features of the similarity matrix between the medicine and the disease and each feature in the similarity matrix between the side effects, word vector representation between the side effects and the similarity matrix between the medicine and the side effects to obtain 3 matrices;

carrying out outer product operation on all features of the drug-drug similarity matrix and each feature in the similarity matrix between side effects, word vector representation between side effects and the similarity matrix between drugs and side effects respectively to obtain 3 matrices;

performing outer product operation two by two to obtain a multi-channel matrix;

6 matrices are input into a two-dimensional convolutional neural network to learn deep representation of drugs and side effects;

the expression is:

wherein the method comprises the steps ofIs the nth drug-inter-disease similarity matrix or the ith row of the drug-inter-drug similarity matrix,/->Is the similarity matrix between the m-th side effects, the word vector representation between the side effects or the j-th line of the similarity matrix between the drug and the side effects, prot is the vector outer product operation, +.>Is a drug side effect pair; CNN is a two-dimensional convolutional neural network;

step two by two,

Each row of the drug-disease similarity matrix is a feature;

each row of the drug-drug similarity matrix is a feature;

each row of the similarity matrix between side effects is a feature;

carrying out element-level multiplication on all features of the similarity matrix between the medicine and the disease and each feature in the similarity matrix between the side effects, word vector representation between the side effects and the similarity matrix between the medicine and the side effects to obtain 3 vectors;

carrying out element-level multiplication on all features of the drug-drug similarity matrix, the similarity matrix between side effects, word vector representation between side effects and each feature in the similarity matrix between the drug and the side effects to obtain 3 vectors;

adding and inputting the 6 vectors into a fully-connected network to extract fusion characteristics with fine granularity;

the expression is:

wherein the method comprises the steps ofIs the nth drug-inter-disease similarity matrix or the ith row of the drug-inter-drug similarity matrix,/->Is the similarity matrix between the mth side effects, the word vector representation between the side effects or the j-th line of the similarity matrix between the drug and the side effects, +.,is a drug side effect pair;

step two, two and three,

Two drug side effects are serially connected to a fully-connected neural network

Where I is the parameter matrix representing the join operation, W is the parameter matrix that can be learned, P ² Is a second drug side effect pair;

for example, a characteristic dimension is 1×200, and a characteristic dimension is also 1×200, and the concatenation becomes 1×400.

Other steps and parameters are the same as in one to eight of the embodiments.

Detailed description ten: this embodiment differs from one of the embodiments one to nine in that: in the third step, the side effects of the medicines learned in the first step and the second step are sent into a multi-layer perceptron in series to be predicted, whether the medicines are associated with the side effects or not is predicted, and the frequency scores of the medicines and the side effects are calculated when the medicines are associated with the side effects; the expression is:

y＝MLP(P ¹ ||P ² )

wherein MLP is a multi-layer perceptron; y outputs an association score and a frequency score between drug side effect pairs.

Other steps and parameters are the same as in one of the first to ninth embodiments.

Experimental performance statistical measures of the error between the true and predicted samples were evaluated by using AUROC (area under Roc curve), AUPR (area under PR curve), RMSE (root mean square error) and MAE (mean absolute error)) as evaluation measures.

Table one: verification method test results

The present invention is capable of other and further embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A frequency fraction prediction method for drug side effects based on multimodal and multi-task approaches, characterized in that: the specific process of the method is as follows:

Step 1: Obtain the semantic features of the chemical structure of the drug molecule, the semantic features of the chemical sequence of the drug molecule, the biomedical text features of the drug, and the biomedical text features of the side effects.

The first drug side effect pair is obtained based on the semantic features of the chemical structure of the drug molecule, the semantic features of the chemical sequence of the drug molecule, the biomedical text features of the drug, and the biomedical text features of the side effects.

Step Two

Step 2: Calculate the similarity information of drugs and the similarity information of side effects using Jaccard similarity and cosine similarity, and map the similarity information of drugs and the similarity information of side effects to the same dimension.

The drug similarity information is a drug-disease similarity matrix and a drug-drug similarity matrix;

The similarity information for side effects includes a similarity matrix between side effects, word vector representations between side effects, and a similarity matrix between drugs and side effects.

Step 22: Based on the drug-disease similarity matrix, drug-drug similarity matrix, side effect similarity matrix, word vector representation of side effects, and drug-side effect similarity matrix, the second drug-side effect pair is obtained.

Step 3: The drug side effects learned in Step 1 and Step 2 are fed into a multilayer perceptron for prediction. The prediction is to determine whether there is a correlation between the drug and the side effects, and if there is a correlation, the frequency scores of the drug and the side effects.

2. The frequency score prediction method for drug side effects based on multimodal and multi-task methods according to claim 1, characterized in that: in step one, the chemical structure semantic features of drug molecules, the chemical sequence semantic features of drug molecules, the biomedical text features of drugs, and the biomedical text features of side effects are obtained.

The specific process is as follows:

Step 1: Select the Graph Attention Neural Network (GAT) to process the drug molecule and obtain the semantic features of the drug molecule's chemical structure.

Steps 1 and 2: Select the Transformer module to process the drug molecule and obtain the semantic features of the drug molecule's chemical sequence.

Step 13: Obtain the biomedical textual features of the drug and the biomedical textual features of its side effects;

Step 14: The chemical structure semantic features of the drug molecule extracted in Step 11, the chemical sequence semantic features of the drug molecule extracted in Step 12, and the biomedical text features and biomedical text features of the drug and its side effects extracted in Step 13 are reduced to the same dimension through a fully connected layer to obtain the dimensionality-reduced chemical structure semantic features, chemical sequence semantic features, biomedical text features, and biomedical text features of the drug and its side effects.

Step 15

Characterization of the chemical structure semantic features and side effects of biomedical text features of drug molecules after dimensionality reduction using element-level product operations 1;

Characterization of chemical sequence semantic features and biomedical text features of side effects of drug molecules using element-level product operations 2;

The biomedical text features of drugs and the biomedical text features of side effects are represented by element-level product operations after dimensionality reduction.

The summation of representations 1, 2, and 3 is fed into a fully connected layer. The output features of the fully connected layer are then passed through an activation function and a batch normalization layer to obtain the drug side effect pairs learned by the first module.

3. The frequency score prediction method for drug side effects based on multimodal and multitasking as described in claim 2, characterized in that: in step one-to-one, a graph attention neural network (GAT) is selected to process the drug molecule to obtain the semantic features of the drug molecule's chemical structure; the specific process is as follows:

Collect the SMILES sequences of drug molecules and convert them into an undirected molecular graph G using the RDKit tool;

Undirected molecular graph G = (V, E);

Where V represents the set of atoms, and E represents the set of chemical bonds between atoms;

The feature matrix of drug molecules is constructed using the unique thermal vectors of atoms;

The adjacency matrix of the drug molecule is constructed using the two-dimensional structure of the drug molecule. Each atom of the drug is represented as a node. If there is a bond between two atoms, the row and column corresponding to the two atom nodes are set to 1 in the neighbor matrix. If there is no bond between two atoms, the row and column corresponding to the two atom nodes are set to 0.

The feature matrix and adjacency matrix of the drug molecule are input into the graph attention neural network (GAT). The output features of the GAT are input into the max pooling layer, and the max pooling layer outputs the chemical structure semantic features of the drug molecule.

4. The frequency score prediction method for drug side effects based on multimodal and multi-task approaches according to claim 3, characterized in that: in steps one and two, the Transformer module is selected to process the drug molecule to obtain the chemical sequence semantic features of the drug molecule; the specific process is as follows:

Obtain subsequences from an existing corpus;

Collect the SMILES sequences of drug molecules;

The SMILES sequence of a drug molecule was decomposed into subsequences using the BPE algorithm and a corpus.

Where d _i is the SMILES sequence of the i-th drug molecule;

s _j is a subsequence of the SMILES sequence d _i of the j-th drug molecule.

Next, the subsequence is fed into the Transformer module to extract the chemical sequence semantic features of the drug molecule; the specific process is as follows:

set up

a ₁ = ma ₂

MultiHead(Q,K,V)=Concat(h ₁ ,...,h _m )W

Where Attention represents attention weights; Q represents the query matrix; K is the index matrix; V is the matrix obtained by weighting according to the attention weights; MultiHead represents the matrix obtained by concatenating m attention heads; Concat represents concatenating the results of the multi-head attention mechanism; W represents the learnable parameter matrix; _hm represents the learning result of the m-th attention head; _a1 represents the dimension size; and _a2 represents the set feature dimension. It is a parameter matrix, where m is the number of attention heads; softmax represents mapping the inner product of Q and K to a probability distribution between [0,1], and represents the attention weights; T represents the transpose; For vector dimensions.

5. The method for predicting the frequency score of drug side effects based on multimodal and multi-task approaches according to claim 4, characterized in that: in steps one and three, the biomedical text features of the drug and the biomedical text features of the side effects are obtained; the specific process is as follows:

Collect biomedical texts of drugs from WIKI or PubChem;

Collect biomedical texts of side effect sets from WIKI or PubChem;

The biomedical text information of the drug and its side effects are respectively input into the BioBert pre-trained model to extract the biomedical text features of the drug and its side effects; represented as:

Where N is the number of drugs or side effects, f is the output dimension of the BioBert pre-trained model, and R is a real number. These are the biomedical textual characteristics of drugs. These are biomedical textual features related to side effects; It is the first The medical text of a drug, It is the first Medical texts about side effects; BERT is a pre-trained BioBert model.

6. The method for predicting the frequency fraction of drug side effects based on multimodal and multitasking according to claim 5, characterized in that: the expression for step one five is:

Where σ is the activation function and W is the learnable parameter matrix. These are the semantic features of the chemical structure of drug molecules. These are the semantic features of the chemical sequence of drug molecules. These are biomedical textual features of drugs; ⊙ indicates... Separately and Perform element-wise multiplication between pairs of elements. ^P1 represents the biomedical text features of side effects, P1 represents the drug side effect pairs learned in the first module, and sum represents the vector addition operation.

7. The method for predicting the frequency score of drug side effects based on multimodal and multi-task approaches according to claim 6, characterized in that: the process of obtaining the drug-disease similarity matrix in step two is as follows:

Drug-disease associations are extracted from comparative toxicology genomics databases. A drug-disease association matrix is obtained based on these associations. Cosine similarity is calculated on the drug-disease association matrix to obtain a drug-disease similarity matrix.

The process of obtaining the drug-drug similarity matrix in step two is as follows:

Query drug-drug similarity scores using the STITCT database;

Each drug-drug similarity score was set from 0 to 1000, and then the scores from 0 to 1 were compressed to between 0 and 1 by the same ratio.

8. The method for predicting the frequency score of drug side effects based on multimodal and multi-task approaches according to claim 7, characterized in that: the process of obtaining the similarity matrix among side effects in step two-one is as follows:

Retrieve side effect information from the ADReCS database;

A matrix is constructed based on the acquired side effect information;

If no two side effects have a common parent node, then the similarity between the two side effects is 0.

If any pair of side effects have a common parent node, then the similarity between the pair of side effects is μ.

If the parent node of each pair of side effects is at a higher level, then the similarity between the pair of side effects is ^μ2 .

This process is repeated until the similarity between all side effects has been calculated.

Fill the matrix with the similarity scores of all side effects to obtain the similarity matrix between side effects;

The process of obtaining the word vector representation of side effects in step two-one is as follows:

Obtain a dataset consisting of q side effect words;

Input each side effect word in the dataset into the trained GloVe model to output p-dimensional features;

A total of q side effect words are input into the trained GloVe model to obtain a p×q feature matrix;

Cosine similarity is calculated on the p×q feature matrix to obtain word vector representations between side effects;

The process of obtaining the similarity matrix between drugs and side effects in step two is as follows:

Extract the associations between drugs and side effects from the training set, and obtain a drug-side effect association matrix based on these associations;

After transposing the drug-side effect correlation matrix, cosine similarity calculation is performed to obtain the similarity matrix between drugs and side effects.

9. The frequency score prediction method for drug side effects based on multimodal and multi-task approaches according to claim 8, characterized in that: in step two, a second drug-side effect pair is obtained based on the drug-disease similarity matrix, the drug-drug similarity matrix, the side effect similarity matrix, the word vector representation of side effects, and the drug-side effect similarity matrix; the specific process is as follows:

Step Two Two One,

Each row of the drug-disease similarity matrix represents a feature;

Each row of the drug-drug similarity matrix represents a feature;

Each row of the similarity matrix between side effects represents a feature;

Each line of the word vector representation between side effects is a feature;

Each row of the similarity matrix between drugs and side effects represents a feature;

The drug-disease similarity matrix is multiplied by each feature of the side effect similarity matrix, the word vector representation of side effects, and the drug-side effect similarity matrix to obtain three matrices.

Perform an outer product operation on each feature of the drug-drug similarity matrix and each feature of the side effect similarity matrix, the word vector representation of side effects, and the drug-side effect similarity matrix to obtain three matrices;

A two-dimensional convolutional neural network is used to learn deep representations of drugs and their side effects by taking six matrix inputs.

The expression is:

in It is the i-th row of the nth drug-disease similarity matrix or drug-drug similarity matrix. This represents the similarity matrix between the m-th side effects, the word vector representations of side effects, or the j-th row of the similarity matrix between drugs and side effects. Prot is the vector outer product operation. It refers to the side effects of medication; CNN is a two-dimensional convolutional neural network;

Step Two Two Two,

Each row of the drug-disease similarity matrix represents a feature;

Each row of the drug-drug similarity matrix represents a feature;

Each row of the similarity matrix between side effects represents a feature;

Each line of the word vector representation between side effects is a feature;

Element-wise multiplication is performed on all features of the drug-disease similarity matrix with each feature of the side effect similarity matrix, the word vector representation of side effects, and the drug-side effect similarity matrix to obtain three vectors.

Element-wise multiplication is performed on each feature of the drug-drug similarity matrix with each feature of the side-effect similarity matrix, the word vector representation of side effects, and the drug-side-effect similarity matrix, resulting in three vectors.

The six vectors are summed and input into a fully connected network to extract fine-grained fusion features;

The expression is:

in It is the i-th row of the nth drug-disease similarity matrix or drug-drug similarity matrix. Let be the similarity matrix between the m-th side effects, the word vector representation of side effects, or the j-th row of the similarity matrix between drugs and side effects. ⊙ represents element-wise multiplication, sum is vector addition, W is the learnable parameter matrix, and σ is the activation function. It's a side effect of the drug;

Steps two, two, three

The side effects of two drugs are fed in series into a fully connected neural network.

Where || denotes the join operation, W is the learnable parameter matrix, and ^P2 is the second drug side effect pair.

10. The method for predicting the frequency score of drug side effects based on multimodal and multitasking according to claim 9, characterized in that: in step three, the drug side effect pairs learned in steps one and two are fed in series into a multilayer perceptron for prediction, predicting whether there is a correlation between the drug and the side effects, and what the frequency score of the drug and the side effects is when a correlation exists; the expression is:

y = MLP( ^P1 || ^P2 )

MLP stands for Multilayer Perceptron; y outputs the correlation score and frequency score between drug side effect pairs.