CN115762646B - A pan-cancer common driver pathway identification method based on GAN sample balance - Google Patents
A pan-cancer common driver pathway identification method based on GAN sample balance Download PDFInfo
- Publication number
- CN115762646B CN115762646B CN202211581374.8A CN202211581374A CN115762646B CN 115762646 B CN115762646 B CN 115762646B CN 202211581374 A CN202211581374 A CN 202211581374A CN 115762646 B CN115762646 B CN 115762646B
- Authority
- CN
- China
- Prior art keywords
- matrix
- gene
- cancer
- chromosome
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 90
- 201000011510 cancer Diseases 0.000 title claims abstract description 72
- 238000000034 method Methods 0.000 title claims abstract description 33
- 230000037361 pathway Effects 0.000 title claims description 13
- 206010069754 Acquired gene mutation Diseases 0.000 claims abstract description 26
- 230000037439 somatic mutation Effects 0.000 claims abstract description 26
- 239000006185 dispersion Substances 0.000 claims abstract description 7
- 108090000623 proteins and genes Proteins 0.000 claims description 86
- 239000011159 matrix material Substances 0.000 claims description 81
- 210000000349 chromosome Anatomy 0.000 claims description 57
- 230000006870 function Effects 0.000 claims description 32
- 230000035772 mutation Effects 0.000 claims description 28
- 238000012549 training Methods 0.000 claims description 25
- 208000009119 Giant Axonal Neuropathy Diseases 0.000 claims description 20
- 201000003382 giant axonal neuropathy 1 Diseases 0.000 claims description 20
- 230000003321 amplification Effects 0.000 claims description 17
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 17
- 230000004913 activation Effects 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 7
- 238000003780 insertion Methods 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000005215 recombination Methods 0.000 claims description 3
- 230000006798 recombination Effects 0.000 claims description 3
- 230000008521 reorganization Effects 0.000 claims description 3
- 230000009469 supplementation Effects 0.000 claims description 3
- 230000001776 parthenogenetic effect Effects 0.000 claims 2
- 238000000605 extraction Methods 0.000 claims 1
- 230000037431 insertion Effects 0.000 claims 1
- 230000002068 genetic effect Effects 0.000 abstract description 8
- 210000004027 cell Anatomy 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000004850 protein–protein interaction Effects 0.000 description 3
- 102000002274 Matrix Metalloproteinases Human genes 0.000 description 2
- 108010000684 Matrix Metalloproteinases Proteins 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000037438 passenger mutation Effects 0.000 description 2
- 230000008506 pathogenesis Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 108700026215 vpr Genes Proteins 0.000 description 2
- 101150101112 7 gene Proteins 0.000 description 1
- 208000005623 Carcinogenesis Diseases 0.000 description 1
- 206010064571 Gene mutation Diseases 0.000 description 1
- 101150042441 K gene Proteins 0.000 description 1
- 102000007474 Multiprotein Complexes Human genes 0.000 description 1
- 108010085220 Multiprotein Complexes Proteins 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000036952 cancer formation Effects 0.000 description 1
- 230000000711 cancerogenic effect Effects 0.000 description 1
- 231100000504 carcinogenesis Toxicity 0.000 description 1
- 231100000315 carcinogenic Toxicity 0.000 description 1
- 230000005754 cellular signaling Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000009792 diffusion process Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000009510 drug design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000011423 initialization method Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 230000009456 molecular mechanism Effects 0.000 description 1
- 230000036438 mutation frequency Effects 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 230000008266 oncogenic mechanism Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a method for identifying a common driving path of a cancer flood based on GAN sample balance, which comprises the following steps of 1) generating somatic mutation data which corresponds to cancers and accords with real data distribution, 2) minimizing a model CDP-HA of the dispersion among total weights of all cancers, and 3) introducing a single parent genetic algorithm to solve the model CDP-HA. The method becomes a useful tool for identifying the common driving path of the cancer, and has strong expansibility and practicability.
Description
Technical Field
The invention relates to the field of bioinformatics, and is used for identifying a cancer driving channel, in particular to a common driving channel identification method for cancers based on GAN sample balance.
Background
Cancer is a disease that threatens human health and is quite complex, the etiology of which involves a variety of genetic and environmental factors. Understanding the mechanism of carcinogenesis from the molecular level is a great challenge, facilitating diagnosis, treatment and drug design of cancers in medicine. With the rapid development of new generation sequencing technologies (NGS), researchers can better characterize cancer molecules. Currently, several large cancer genome projects (cancer genome map (TCGA), international cancer genome alliance (ICGC), cancer Cell Line Encyclopedia (CCLE)) have generated and analyzed vast amounts of data, providing unprecedented opportunities for further understanding of molecular and oncogenic mechanisms of cancer. Previous studies have shown that only functionally driven mutations promote cancer progression, while passenger mutations have little impact on cancer progression. Distinguishing between function-driven mutations and passenger mutations has become an important task in studying cancer pathogenesis.
Early studies have largely focused on designing individual driver genes that can effectively recognize significantly higher mutation rates. However, cancer occurs due to the tremendous heterogeneity of mutations that occur in different driver gene mutations in the same cancer. Thus, identification of a single driver gene is not effective in understanding the mechanism of cancer progression. Further studies have shown that the occurrence of cancer is often caused by disruption of some of the pathways, which may be disturbed by different combinations of driving mutations (cell signaling or regulatory pathways). Thus, identifying the driving pathway is a key pathway for understanding the cancer carcinogenic mechanism at the pathway level. Currently, the drive path identification problem can be divided into three directions, identifying a single drive path, identifying a cooperative drive path, identifying a common drive path and a specific drive path for the flood. The primary study herein identifies common driving path problems for cancer.
The common driving pathway identified on the scale of pan-cancer is to investigate the commonality that may be present between different cancer types, which is beneficial for enhancing the understanding of the pathogenesis of cancer. The TCGA carcinomatous program has collected multi-platform mutation data generated by thousands of cancer patients of 12 cancer types, providing opportunities for further investigation of such problems. Recently, a class of a priori based research methods have been proposed, which generally use gene-gene interaction (GGI) networks, protein-protein interaction (PPI) networks, and pathway-pathway interaction (PaPaI) networks. While they can have better recognition, relying on a priori knowledge on the one hand can miss the discovery of better combinations of mutated genes and on the other hand can limit the scope of finding pathways, as the prior a priori knowledge is not perfect and contains part of the pathway information, as Leiserson et al propose HotNet method based on directed thermal diffusion model, which tries to obtain pathways and protein complexes by combining protein-protein interaction network analysis, kim et al study the mutual exclusivity of different types between various cancer types and propose MEMCover method for identifying sub-network/pathways based on HumanNet network. Hajkarim et al have proposed DAMOKLE algorithm based on a large gene-gene interaction network that attempts to identify sub-networks with significant differences in sample mutation frequencies in two cancers. Another category is de novo identification methods. Zhang et al propose ComMDP methods that exploit the two characteristics of the drive paths, high mutual exclusivity and high coverage, and then extend the maximum weight submatrix problem model applicable to a single cancer directly to be used for multiple cancer types, attempting to identify the common drive path by accumulating absolute weight values. Wu et al introduced a CDP-V model that used relative proportions instead of absolute numbers and utilized variances to minimize the dispersion of each proportion, while proposed a CDP-H model that used harmonic mean to minimize the dispersion of each proportion, reducing the use of parameters. An attempt is made to identify the common drive path by the relative weight value.
In the research method, the ComMDP method uses absolute weight values, and a group of gene sets with the maximum weight values are obtained by accumulating the absolute weight values of each cancer species, and the problem of unbalanced sample sizes among the cancer species is not considered, so that when the sample size difference of each cancer species is large, the identification result is more prone to cancer species with larger sample sizes, and certain driving paths with higher commonality can be omitted. In addition, the problem of unbalanced sample size is solved by using a relative weight calculation mode in the PGA-V method, but an artificial parameter is introduced, a large number of experiments are required for determining the parameter, and the process of the experiments is complicated, so that the expansion in practical application is not facilitated. However, the problem of sample size imbalance is not actually solved.
With the rapid development of deep learning, the problem of data imbalance is increasingly urgent. In 2012, a data enhancement strategy was proposed that generates additional data items that fit the distribution of real data by converting from existing real data. However, there is no better data generation method in the same period. Until 2014 Goodfellow et al proposed a powerful generation model based on game theory-generating an antagonism network (GANs). Although a large number of deep learning-based generative models were created in the early days, GANs was one of the most successful generative models and has been successfully applied to enhance data in various fields.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for identifying the common driving path of the cancer based on GAN sample balance, which is a useful tool for identifying the common driving path of the cancer, and has strong expansibility and practicability.
The technical scheme for realizing the aim of the invention is as follows:
A method for identifying a common driving path of a cancer cell based on GAN sample balance comprises the following steps:
1) Generating somatic mutation data corresponding to the true data distribution of the cancer:
1.1 Setting up an countermeasure generation network framework:
Assuming an example training set with m r samples, n r genes, the generator network of SNV-GANs is defined as G (z), the inputs of the generator are z-norm (0, 1), and the generator is defined as follows:
1.1.1 The input layer maps the noise vector z with GFC1 into a tensor zn of dimension (1, 128);
1.1.2 The hidden layer places the tensor zn in the step 1.1.1) into GFC2 for mapping, the obtained result is placed into GFC3 for mapping, the GFC4 is the same, and finally the tensor zn' is mapped into a dimension (1,1024);
1.1.3 The output layer maps tensor zn' to tensor gn of dimension (1, m r*nr) via GFC5 and resets gn to tensor of dimension (m r,nr) TensorAn output of the generator;
Wherein the input layer and the hidden layer both use dropout functions to freeze part of neurons, and adopt an activation function ReLU defined by a formula (1), the output layer uses an activation function Sigmod defined by a formula (2), a discriminator network is defined as D (x), and the input of the discriminator is real data x-P real or generated data X represents a set of somatic mutation data samples, and the discriminator is defined as follows:
1.1.4 The input layer maps x with DFC1 to a tensor xn of the dimension (m r, 256);
1.1.5 The hidden layer maps tensor xn of the step 1.1.4) into DFC2, the obtained result is mapped into DFC3, the DFC4 is the same, and finally the tensor xn' is mapped into dimension (m r, 16);
1.1.6 The output layer maps the tensor xn' to the tensor dn of the dimension (m r, 1), i.e. the output of the discriminator, by DFC5, wherein both the input layer and the hidden layer use the activation function ReLU defined by equation (1) and the output layer uses the activation function Sigmod defined by equation (2);
ReLU(x)=max(0,x) (1),
1.2 Training process of SNV-GANs):
1.2.1 Given a body cell mutation matrix A r(mr×nr) and a proportion parameter of randomly extracted samples (oc < 1), randomly extracting a submatrix M r with the number of samples of M, m=m r X oc from the matrix A r according to the proportion parameter (oc), constructing a training set X by extracting 64 submatrices M r in total, and inputting the training set X into a generated countermeasure network for training;
1.2.2 Initializing parameters θ d of discriminator D (), parameters θ g of generator G ();
1.2.3 Let current round epoch=1, randomly generate a1×100 gaussian distributed noise vector z;
1.2.4 Using z in step 1.2.3) as input to the generator to obtain a vector of size m×n r 1.2.5 Calculating a generator loss value according to equation (3), and then updating a parameter θ g of the generator:
Where G (z (i)) represents the generated data generated by the generator by the noise vector z (i), and D (G (z (i))) represents the probability that the discriminator generated data is determined to be true data, the smaller the loss G is, the better;
1.2.6 Randomly extracting a sample set X from the training set X;
1.2.7 Randomly generating a1 x 100 gaussian distributed noise vector z;
1.2.8 Using z in step 1.2.7) as input to the generator to obtain a vector of size m×n r 1.2.9 Calculating a generator loss value according to equation (4), and then updating a parameter θ g of the generator:
Where D (x (i)) represents the probability that the discriminator will determine that the generated data is true data, 1-D (G (z (i))) represents the probability that D will determine that the generated data is generated data, and the larger the loss D, the better;
1.2.10 Judging whether the current round epoch reaches the set maximum round, if so, stopping training, otherwise, returning to the step 1.2.3) to finally obtain a trained generator G ();
1.3 Data processing:
1.3.1 Randomly generating a1 x 100 gaussian distributed noise vector z;
1.3.2 Inputting the vector z in step 1.3.1) to a generator G (), obtained by training, resulting in generated data G data =g (z);
1.3.3 Setting the value of G data which is more than or equal to 0.85 as 1 and the value of G data which is less than 0.85 as 0 to obtain a new binary matrix A fakedata;
1.3.4 Taking the maximum sample size of cancers in the somatic mutation matrixes of R cancers as m max, 0< max < R, and inserting the matrix A r into an amplification matrix At this time, the amplification matrixThe number of samples is
1.3.5 If the number of samples to be amplified is requiredA sample number m r greater than matrix a r, performing step 1.3.6), if an amplified sample number is desiredA sample number m r smaller than matrix a r, step 1.3.8) is performed;
Step 1.3.6), randomly extracting a number m r of samples from A fakedata in step 1.3.3) Separately calculating matricesAnd mutation rate of each gene in the matrix A r to respectively obtain two corresponding mutation probability sets V and Q;
1.3.7 Inputting the two sets V and Q obtained in step 1.3.6) into JS divergence formula defined by formula (5) to obtain a distribution value, extracting matrix with smaller distribution value The more similar the mutation rate in matrix A r is, the less the distribution value is 0.09, i.e., matrixInsertion amplification matrixIn (2) and updateOtherwise, repeating the step 1.3.6);
1.3.8 Randomly extracting a sample from A fakedata in step 1.3.3) and directly adding the sample to the amplification matrix In (3), update
1.3.9 If the current amplification matrixIs a sample of (a)Equal to the maximum sample size m max, and the sample supplementation is finished, otherwise, the step 1.3.5) is performed to finally obtain an amplification matrix equal to the maximum sample size m max Re-matrix2) Model CDP-HA to minimize dispersion between individual total weights of cancer:
is provided with R, R.gtoreq.2 cancer types, for each of which a binary somatic mutation matrix is expressed as For recording whether a gene in a sample is mutated, it has m r rows and n r columns, the rows representing the sample or patient, the columns representing the gene, r=1, 2,3, r., a i- represents the ith sample in matrix A r, a -j represents the jth gene in matrix A r, and when mutation occurs in the jth gene of the ith sample in the mutation matrix of the nth cancer,OtherwiseGiven a set of gene sets S of size k,Representing a submatrix of size m r xk in the corresponding matrix a r,Representing a submatrixA sample in which the gene a is mutated,Representation ofThe total number of samples covered in (c) is used to measure the coverage of the gene set S, Overlapping the covered sample sum, and measuring the mutual exclusivity of the gene set S;
According to the definition of the symbol and the problem in the previous paragraph, a nonlinear maximization weight function model CDP-HA is constructed, wherein given m r rows and n r columns of binary somatic mutation matrixes A r of R cancer types, a parameter K is adopted, W C (S) is made to be the maximum weight and function, and an m multiplied by K submatrix is determined The specific formula (6) is as follows:
Wherein the method comprises the steps of Representing the absolute weight value of the gene set S in the r-th cancer species;
3) And (3) introducing a single parent genetic algorithm to solve the model CDP-HA:
3.1 Setting a fitness function:
Assuming that given chromosome E, let M E represent a sub-matrix corresponding to the chromosome, the size of matrix M E is m×K, the definition of the Fitness function Fitness (E) is shown in the following formula (7), and the larger the Fitness function value, the better the feasible solution scheme;
Fitness(E)=WC(ME) (7);
3.2 Setting a selection operator:
Adopting roulette selection and elite strategy to generate a new generation population, directly inheriting individuals with highest fitness from father to offspring, and then using a roulette selection operator to generate the rest N-1 individuals;
3.3 Setting a reorganization operator:
A recombination operator based on greedy strategy is adopted, and the steps are as follows: first, given a parent chromosome e= { E 1,e2,...,ek}(ei =1, 2, once again, n), wherein E i represents a gene number, and thus E is also referred to as a gene set, thereby determining a candidate gene set Then randomly deleting one gene from gene set E to obtain gene set E ′, finally selecting optimum gene from candidate set based on greedy strategyI.e.And generates the final new offspring
3.4 Setting parameters:
Inputting a somatic mutation matrix A r after the enhancement of R cancers, the gene number g number, the parameter k, the population size N, the algorithm execution times t and the maximum evolution algebra maxg;
3.5 Constructing an initial population:
The chromosomes are encoded in a decimal encoding mode, one chromosome represents an individual and is used for representing a solution vector of a problem, in a single parent genetic algorithm, a set of K genes is used as a problem solution, namely E= { E 1,e2,...,ek}(ei =1, 2, and the number n, the individual initialization method in the population is that natural data sets of 1 to n are randomly generated, each number represents one gene in a mutation matrix, and n genes are grouped in sequence, so that n/K gene sets S 1,S2,...,Sn/k are obtained. Order the Selecting the genes of the gene set S max to form an initial chromosome, generating an initial population by generating N initial chromosomes, selecting the first K numbers as an initial chromosome, generating an initial population pop 0, calculating the adaptive value of the pop 0 population chromosome, comparing the optimal chromosomes in the pop 0, storing the best individuals in a variable best, and enabling the initial iteration number step=0;
3.6 Performing an iterative operation:
3.6.1 If step > maxg, go to step 3.6.5), get the public drive channel with size K, otherwise go to step 3.6.2);
3.6.2 For the population pop step, firstly putting best chromosomes with highest fitness values in pop step into pop step+1, and then executing a roulette selection operator to select the rest N-1 chromosomes to put into pop step+1;
3.6.3 If step <0.7 x maxg or Fitness (E ') > Fitness (E), updating chromosome e=e', otherwise not updating, retaining E, step=step+1;
3.6.4 Taking the chromosome with the highest fitness value in the pop step+1, and if the fitness value of the chromosome is larger than that of the best chromosome, updating the best chromosome, namely the best chromosome with best=pop step+1;
3.6.5 The best chromosome is converted into a gene set, so that a submatrix M is obtained, the submatrix M is output, and the output M is a public driving channel S with the size of K.
According to the technical scheme, a countermeasure generation network (GENERATIVE ADVERSARIAL Networks, GAN for short) is used for generating samples of somatic mutation data of few-sample cancers, so that the samples among a plurality of cancer species reach balance, a mathematical model CDP-HA for accumulating absolute weight values of each cancer species by using a harmonic mean value is utilized, and finally a single-parent genetic algorithm is used for solving the model. For several types of cancers, the method provided by the technical scheme can effectively supplement cancers with small sample number, and solves the problem of sample number difference. Meanwhile, the gene set identified based on the proposed model is mutated not only in most samples of these cancers, but also in very close proportions of mutated samples in individual cancers. In addition, the method detects biologically significant gene sets that are deleted in other methods. Therefore, the sample data volume difference between different types of cancers can be reduced based on the countermeasure generation network, and a new thought is provided for identifying a common driving path of the cancers.
Compared with the prior art, the technical scheme has the following advantages:
(1) A non-linear, minimized dispersion maximization weighting function is designed to measure the relative weights of multiple cancer types.
(2) The data enhancement method SNV-GANs suitable for cancer somatic mutation data has high use value, and GANs is also applied to the cancer somatic mutation data for the first time.
(3) The common driving path of the cancer, which is found by the whole technical scheme, contains more genes which are enriched in the same important signal path, and the identified genes are enriched in more important signal paths.
The method can be used as a useful tool for enhancing somatic mutation data of biological cancers and a useful tool for identifying common driving paths of the cancers, can provide more biological information, has strong practicability before expansibility, can identify more genes enriched in important signal paths, and can identify the genes enriched in more important signal paths.
Drawings
FIG. 1 is a diagram showing an example of a common driving path in an embodiment;
FIG. 2 is an exemplary diagram of a single parent genetic algorithm in an embodiment;
FIG. 3 is a network model of SNV-GANs in an example;
FIG. 4 is a schematic diagram of a training process for SNV-GANs in an example;
FIG. 5 is a pseudo code of the SNV-GANs training process in the example;
Fig. 6 is a diagram showing an example of the actual effect of generating data in an example.
Detailed Description
The invention will now be described in further detail with reference to the drawings and specific examples, which are not intended to limit the invention thereto.
Examples:
In the experimental step 1), a Linux server (Intel (R) Xeon (R) Gold 6230.10 GHz CPU, the memory is 256G, the video memory is 32G) and the compiling operation environment is Python 3.7.9. Steps 2) and 3) are performed on a computer (Intel (R) Core (TM) i 5-6500.20 GHz CPU, memory is 32G), operating system is Windows 10, compiling running tool is Eclipse 4.23, and compiling environment Java 1.8.0.
This example is described with respect to the problem of common drive path identification for cancer.
A method for identifying a common driving path of a cancer cell based on GAN sample balance comprises the following steps:
Somatic mutation data a r:A1 (COADCORE has 95 samples, 211 genes), a 2 (BLCA has 95 samples, 211 genes), a 3 (LUAD has 95 samples, 211 genes) were assumed for three cancer types.
1) Generating somatic mutation data corresponding to the true data distribution of the cancer:
1.1 Setting up an countermeasure generation network framework:
Assuming that BLCA cancer somatic mutation data A 2 is used as an example training set, the generator network of SNV-GANs is defined as G (z), the inputs of the generator are z-norm (0, 1), and the generator is defined as follows, as shown in FIG. 3:
1.1.1 The input layer maps the noise vector z with GFC1 into a tensor zn of dimension (1, 128);
1.1.2 The hidden layer places the tensor zn in the step 1.1.1) into GFC2 for mapping, the obtained result is placed into GFC3 for mapping, the GFC4 is the same, and finally the tensor zn' is mapped into a dimension (1,1024);
1.1.3 The output layer maps tensor zn' to tensor gn of dimension (1, 95 x 211) via GFC5 and resets gn to tensor of dimension (95,211) TensorAn output of the generator;
Wherein the input layer and the hidden layer both use dropout functions to freeze part of neurons, and adopt an activation function ReLU defined by a formula (1), the output layer uses an activation function Sigmod defined by a formula (2), a discriminator network is defined as D (x), and the input of the discriminator is real data x-P real or generated data X represents a set of somatic mutation data samples, and the discriminator is defined as follows, as shown in fig. 3:
1.1.4 The input layer maps x with DFC1 to a tensor xn of dimension (95,256);
1.1.5 The hidden layer places the tensor xn of the step 1.1.4) into the DFC2 for mapping, the obtained result is placed into the DFC3 for mapping, the DFC4 is the same, and finally the tensor xn' is mapped into the dimension (95,16);
1.1.6 The output layer maps the tensor xn' to the tensor dn of the dimension (95,1), i.e., the output of the discriminator, by DFC5, wherein both the input layer and the hidden layer use the activation function ReLU defined by equation (1) and the output layer uses the activation function Sigmod defined by equation (2);
ReLU(x)=max(0,x) (1),
1.2 Training process of SNV-GANs):
1.2.1 Given a body cell mutation matrix a 2 (95×211) and a scaling parameter of randomly extracted samples ∈oc=0.7, randomly extracting a submatrix M 2 with the number of samples m=95×0.7≡67 from the matrix a r according to the scaling parameter ∈oc, and constructing a training set X by extracting 64 submatrices M 2 in total, and inputting the training set X into a generated countermeasure network, as shown in fig. 4;
1.2.2 Initializing parameters θ d of discriminator D (), parameters θ g of generator G ();
1.2.3 Let current round epoch=1, randomly generate a1×100 gaussian distributed noise vector z;
1.2.4 Using z in step 1.2.3) as input to the generator to obtain a vector of size 67 x 211
1.2.5 Calculating a generator loss value according to equation (3), and then updating a parameter θ g of the generator:
Where G (z (i)) represents the generated data generated by the generator by the noise vector z (i), and D (G (z (i))) represents the probability that the generator determines the generated data as real data, the smaller the loss G is, the better;
1.2.6 Randomly extracting a sample set X from the training set X;
1.2.7 Randomly generating a1 x 100 gaussian distributed noise vector z;
1.2.8 Using z in step 1.2.7) as input to the generator to obtain a vector of size 95 x 211
1.2.9 Calculating a generator loss value according to equation (4), and then updating a parameter θ g of the generator:
Where D (x (i)) represents the probability that the discriminator will determine that the generated data is true data, 1-D (G (z (i))) represents the probability that D will determine that the generated data is generated data, and the larger the loss D, the better;
1.2.10 Judging whether the current cycle epoch reaches 10000 times, if so, stopping training, otherwise, returning to the step 1.2.3) to finally obtain a trained generator G ();
1.3 Data processing:
1.3.1 Randomly generating a1 x 100 gaussian distributed noise vector z;
1.3.2 Inputting the vector z in step 1.3.1) to a generator G (), obtained by training, resulting in generated data G data =g (z);
1.3.3 Setting the value of G data which is more than or equal to 0.85 as 1 and the value of G data which is less than 0.85 as 0 to obtain a new binary matrix A fakedata;
1.3.4 Taking the maximum sample size of cancer in 3 cancer somatic mutation matrixes, namely m max=m1 =489, and inserting matrix A 2 into an amplification matrix At this time, the amplification matrixThe number of samples is
1.3.5 If the number of samples to be amplified is requiredA sample number m 2 =95 greater than matrix a 2, performing step 1.3.6), if an amplified sample number is requiredA sample number m 2 =95 smaller than the matrix a 2, step 1.3.8 is performed;
1.3.6 Randomly extracting a number of m 2 samples from A fakedata in step 1.3.3) Separately calculating matricesAnd mutation rate of each gene in matrix A 2 to obtain two corresponding mutation probability sets V and Q respectively, 1.3.7) inputting the two sets V and Q obtained in step 1.3.6) into JS scattering formula defined by formula (5) to obtain a distribution value, extracting matrix as the distribution value is smallerThe more similar the mutation rate in matrix A r is, the less the distribution value is 0.09, i.e., matrixInsertion amplification matrixIn (2) and updateOtherwise, repeating the step 1.3.6);
1.3.8 Randomly extracting a sample from A fakedata in the step 1.3.3) and directly adding the sample into the matrix A r to update
1.3.9 If the current amplification matrixIs a sample of (a)Equal to the maximum sample size m max =489, and the sample supplementation is finished, otherwise, step 1.3.5) is performed to finally obtain a matrix equal to the maximum sample size m max =489Obtaining a new amplification matrixIs substantially identical to the mutation rate of the original matrix A 2, and then the matrix is madeTraining process pseudocode as in FIG. 5;
2) Model CDP-HA to minimize dispersion between individual total weights of cancer:
There are r=3 cancer types, for each of which a binary somatic mutation matrix is expressed as For recording whether a gene in a sample is mutated, it has m r rows and n r columns, the rows representing the sample or patient, the columns representing the gene, r=1, 2,3, r., a i- represents the ith sample in matrix A r, a -j represents the jth gene in matrix A r, and when mutation occurs in the jth gene of the ith sample in the mutation matrix of the nth cancer,OtherwiseGiven a set of gene sets S of size k,Representing a submatrix of size m r xk in the corresponding matrix a r,Representing a submatrixA sample in which the gene a is mutated,Representation ofThe total number of samples covered in (c) is used to measure the coverage of the gene set S, Overlapping the covered sample sum, and measuring the mutual exclusivity of the gene set S;
According to the definition, a nonlinear maximization weight function model CDP-HA is constructed, wherein given m r rows and n r columns of binary somatic mutation matrixes A r of R cancer types, a parameter K (0<K is less than or equal to 10) is adopted, W C (S) is made to be the maximum weight and function, and an m multiplied by K submatrix is determined The specific formula (6) is as follows:
Wherein the method comprises the steps of Representing the absolute weight of the gene set S in the r cancer species, as shown in FIG. 1, is a mutation matrix of three cancers, with two submatrices S 1 and S 2, the size of the scale K is 3, and can be obtained according to the model proposed in the exampleThe weight of S 2 is found to be higher, and S 2 is more in line with the commonality required by the common driving path;
3) And (3) introducing a single parent genetic algorithm to solve the model CDP-HA:
3.1 Setting a fitness function:
Assuming that given chromosome E, let M E represent a sub-matrix corresponding to the chromosome, the size of matrix M E is m×K, the definition of the Fitness function Fitness (E) is shown in the following formula (7), and the larger the Fitness function value, the better the feasible solution scheme;
Fitness(E)=WC(ME) (7);
3.2 Setting a selection operator:
Adopting roulette selection and elite strategy to generate a new generation population, directly inheriting individuals with highest fitness from father to offspring, and then using a roulette selection operator to generate the rest N-1 individuals;
3.3 Setting a reorganization operator:
A recombination operator based on greedy strategy is adopted, and the steps are as follows: first, given a parent chromosome e= { E 1,e2,...,ek}(ei =1, 2, once again, n), wherein a gene number is represented, thus E is also referred to as a gene set, thereby determining a candidate gene set Then randomly deleting one gene from gene set E to obtain gene set E ′, finally selecting optimum gene from candidate set based on greedy strategyI.e.And generates the final new offspring
3.4 Setting parameters:
Inputting a somatic mutation matrix A r after 3 kinds of cancer enhancement, wherein the gene number g number =211, the parameter k=3, the population size N=20, the algorithm execution times step=10 and the maximum evolution algebra maxg =1000;
3.5 Constructing an initial population:
The chromosomes are encoded in a decimal encoding mode, one chromosome represents an individual and is used for representing a solution vector of a problem, in a single parent genetic algorithm, a set consisting of K=3 genes is used as a problem solution, namely E= { E 1,e2,...,ek}(ei =1, 2..the number n), and the individual in the population is initialized by randomly generating natural data sets of 1 to 20, wherein each number represents one gene in a mutation matrix, and the 20 genes are grouped in sequence to obtain n/k=20/3 approximately 7 gene sets S 1,S2,...,Sn/k. Order the Selecting the genes of the gene set S max to form initial chromosomes, and generating an initial population by generating N initial chromosomes, wherein the first K=3 numbers are selected as an initial chromosome to generate an initial population pop 0, the population size is N, the adaptive value of the chromosome of the pop 0 population is calculated, the optimal chromosomes in the pop 0 are compared, the best individuals are stored in a variable best, and the initial iteration number step=0 is shown in fig. 2;
3.6 Performing an iterative operation:
3.6.1 If step > maxg, go to step 3.6.5), get the public drive channel with size K, otherwise go to step 3.6.2);
3.6.2 For population pop step, firstly putting best chromosome with highest Fitness value in pop step into pop step+1, then executing roulette selection operator to select the rest N-1=20-1=19 chromosomes to put into pop step+1, 3.6.3) if step <700 or Fitness (E ') > Fitness (E), updating chromosome E=E', otherwise not updating, retaining E, step=step+1;
3.6.4 Taking the chromosome with the highest fitness value in the pop step+1, and if the fitness value of the chromosome is larger than that of the best chromosome, updating the best chromosome, namely the best chromosome with best=pop step+1;
3.6.5 The best chromosome is converted into a gene set, so that a submatrix M is obtained, the submatrix M is output, and the output M is a public driving channel S with the size of K=3.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211581374.8A CN115762646B (en) | 2022-12-09 | 2022-12-09 | A pan-cancer common driver pathway identification method based on GAN sample balance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211581374.8A CN115762646B (en) | 2022-12-09 | 2022-12-09 | A pan-cancer common driver pathway identification method based on GAN sample balance |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115762646A CN115762646A (en) | 2023-03-07 |
CN115762646B true CN115762646B (en) | 2025-02-14 |
Family
ID=85344971
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211581374.8A Active CN115762646B (en) | 2022-12-09 | 2022-12-09 | A pan-cancer common driver pathway identification method based on GAN sample balance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115762646B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106326915A (en) * | 2016-08-10 | 2017-01-11 | 北京理工大学 | Improved-Fisher-based chemical process fault diagnosis method |
CN108490204A (en) * | 2011-09-25 | 2018-09-04 | 赛拉诺斯知识产权有限责任公司 | System and method for multiple analysis |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10685738B1 (en) * | 2017-09-19 | 2020-06-16 | Quantigic Genomics LLC | Cancer diagnostic tool using cancer genomic signatures to determine cancer type |
AU2020248338A1 (en) * | 2019-03-28 | 2021-11-18 | Phase Genomics, Inc. | Systems and methods for karyotyping by sequencing |
WO2020234729A1 (en) * | 2019-05-17 | 2020-11-26 | Insilico Medicine Ip Limited | Deep proteome markers of human biological aging and methods of determining a biological aging clock |
WO2022058980A1 (en) * | 2020-09-21 | 2022-03-24 | Insilico Medicine Ip Limited | Methylation data signatures of aging and methods of determining a methylation aging clock |
CN112270952B (en) * | 2020-10-30 | 2022-04-05 | 广西师范大学 | A method to identify cancer driver pathways |
CN114023383A (en) * | 2021-11-04 | 2022-02-08 | 广西师范大学 | Non-parameter nonlinear intelligent optimization method for identifying cancer drive path |
CN115359839A (en) * | 2022-08-17 | 2022-11-18 | 广西师范大学 | A CPGA-SMCMN approach to identify single driver pathways in cancer |
-
2022
- 2022-12-09 CN CN202211581374.8A patent/CN115762646B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108490204A (en) * | 2011-09-25 | 2018-09-04 | 赛拉诺斯知识产权有限责任公司 | System and method for multiple analysis |
CN106326915A (en) * | 2016-08-10 | 2017-01-11 | 北京理工大学 | Improved-Fisher-based chemical process fault diagnosis method |
Also Published As
Publication number | Publication date |
---|---|
CN115762646A (en) | 2023-03-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | An overview of SNP interactions in genome-wide association studies | |
Morlon et al. | Inferring the dynamics of diversification: a coalescent approach | |
Excoffier et al. | Robust demographic inference from genomic and SNP data | |
EP2430441B1 (en) | Method and system for calling variations in a sample polynucleotide sequence with respect to a reference polynucleotide sequence | |
Bansal et al. | Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations | |
Kumar et al. | Evolutionary sparse learning for phylogenomics | |
Ruffieux et al. | A global-local approach for detecting hotspots in multiple-response regression | |
Emily | A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies | |
Medvedev et al. | Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models | |
Bisschop et al. | Sweeps in time: leveraging the joint distribution of branch lengths | |
Yang et al. | Nonparametric functional mapping of quantitative trait loci | |
Ray et al. | Introunet: identifying introgressed alleles via semantic segmentation | |
Woodhams et al. | Simulating and summarizing sources of gene tree incongruence | |
Gaynor et al. | nQuack: An R package for predicting ploidal level from sequence data using site‐based heterozygosity | |
McKibben et al. | Applying machine learning to classify the origins of gene duplications | |
CN119446290A (en) | SNP prediction method for wheat abiotic stress traits based on automated CNN model | |
CN115762646B (en) | A pan-cancer common driver pathway identification method based on GAN sample balance | |
Avadhanam et al. | Simultaneous inference of parental admixture proportions and admixture times from unphased local ancestry calls | |
CN116959561B (en) | A method and device for predicting gene interaction based on neural network model | |
EP3239875A1 (en) | Method for determining genotype of particular gene locus group or individual gene locus, determination computer system and determination program | |
Kasianov et al. | Interspecific comparison of gene expression profiles using machine learning | |
Lall et al. | Generating realistic cell samples for gene selection in scRNA-seq data: A novel generative framework | |
Alizadeh et al. | AICRF: ancestry inference of admixed population with deep conditional random field | |
Lewis Schmalohr et al. | Detection of epistatic interactions with Random Forest | |
Liu et al. | A comprehensive overview and benchmarking analysis of fast algorithms for genome-wide association studies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |