[go: up one dir, main page]

CN115762646B - A pan-cancer common driver pathway identification method based on GAN sample balance - Google Patents

A pan-cancer common driver pathway identification method based on GAN sample balance Download PDF

Info

Publication number
CN115762646B
CN115762646B CN202211581374.8A CN202211581374A CN115762646B CN 115762646 B CN115762646 B CN 115762646B CN 202211581374 A CN202211581374 A CN 202211581374A CN 115762646 B CN115762646 B CN 115762646B
Authority
CN
China
Prior art keywords
matrix
gene
cancer
chromosome
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211581374.8A
Other languages
Chinese (zh)
Other versions
CN115762646A (en
Inventor
欧阳扬
吴璟莉
李高仕
朱凯
龚艳霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Normal University
Original Assignee
Guangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Normal University filed Critical Guangxi Normal University
Priority to CN202211581374.8A priority Critical patent/CN115762646B/en
Publication of CN115762646A publication Critical patent/CN115762646A/en
Application granted granted Critical
Publication of CN115762646B publication Critical patent/CN115762646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for identifying a common driving path of a cancer flood based on GAN sample balance, which comprises the following steps of 1) generating somatic mutation data which corresponds to cancers and accords with real data distribution, 2) minimizing a model CDP-HA of the dispersion among total weights of all cancers, and 3) introducing a single parent genetic algorithm to solve the model CDP-HA. The method becomes a useful tool for identifying the common driving path of the cancer, and has strong expansibility and practicability.

Description

Method for identifying common driving path of cancer based on GAN sample balance
Technical Field
The invention relates to the field of bioinformatics, and is used for identifying a cancer driving channel, in particular to a common driving channel identification method for cancers based on GAN sample balance.
Background
Cancer is a disease that threatens human health and is quite complex, the etiology of which involves a variety of genetic and environmental factors. Understanding the mechanism of carcinogenesis from the molecular level is a great challenge, facilitating diagnosis, treatment and drug design of cancers in medicine. With the rapid development of new generation sequencing technologies (NGS), researchers can better characterize cancer molecules. Currently, several large cancer genome projects (cancer genome map (TCGA), international cancer genome alliance (ICGC), cancer Cell Line Encyclopedia (CCLE)) have generated and analyzed vast amounts of data, providing unprecedented opportunities for further understanding of molecular and oncogenic mechanisms of cancer. Previous studies have shown that only functionally driven mutations promote cancer progression, while passenger mutations have little impact on cancer progression. Distinguishing between function-driven mutations and passenger mutations has become an important task in studying cancer pathogenesis.
Early studies have largely focused on designing individual driver genes that can effectively recognize significantly higher mutation rates. However, cancer occurs due to the tremendous heterogeneity of mutations that occur in different driver gene mutations in the same cancer. Thus, identification of a single driver gene is not effective in understanding the mechanism of cancer progression. Further studies have shown that the occurrence of cancer is often caused by disruption of some of the pathways, which may be disturbed by different combinations of driving mutations (cell signaling or regulatory pathways). Thus, identifying the driving pathway is a key pathway for understanding the cancer carcinogenic mechanism at the pathway level. Currently, the drive path identification problem can be divided into three directions, identifying a single drive path, identifying a cooperative drive path, identifying a common drive path and a specific drive path for the flood. The primary study herein identifies common driving path problems for cancer.
The common driving pathway identified on the scale of pan-cancer is to investigate the commonality that may be present between different cancer types, which is beneficial for enhancing the understanding of the pathogenesis of cancer. The TCGA carcinomatous program has collected multi-platform mutation data generated by thousands of cancer patients of 12 cancer types, providing opportunities for further investigation of such problems. Recently, a class of a priori based research methods have been proposed, which generally use gene-gene interaction (GGI) networks, protein-protein interaction (PPI) networks, and pathway-pathway interaction (PaPaI) networks. While they can have better recognition, relying on a priori knowledge on the one hand can miss the discovery of better combinations of mutated genes and on the other hand can limit the scope of finding pathways, as the prior a priori knowledge is not perfect and contains part of the pathway information, as Leiserson et al propose HotNet method based on directed thermal diffusion model, which tries to obtain pathways and protein complexes by combining protein-protein interaction network analysis, kim et al study the mutual exclusivity of different types between various cancer types and propose MEMCover method for identifying sub-network/pathways based on HumanNet network. Hajkarim et al have proposed DAMOKLE algorithm based on a large gene-gene interaction network that attempts to identify sub-networks with significant differences in sample mutation frequencies in two cancers. Another category is de novo identification methods. Zhang et al propose ComMDP methods that exploit the two characteristics of the drive paths, high mutual exclusivity and high coverage, and then extend the maximum weight submatrix problem model applicable to a single cancer directly to be used for multiple cancer types, attempting to identify the common drive path by accumulating absolute weight values. Wu et al introduced a CDP-V model that used relative proportions instead of absolute numbers and utilized variances to minimize the dispersion of each proportion, while proposed a CDP-H model that used harmonic mean to minimize the dispersion of each proportion, reducing the use of parameters. An attempt is made to identify the common drive path by the relative weight value.
In the research method, the ComMDP method uses absolute weight values, and a group of gene sets with the maximum weight values are obtained by accumulating the absolute weight values of each cancer species, and the problem of unbalanced sample sizes among the cancer species is not considered, so that when the sample size difference of each cancer species is large, the identification result is more prone to cancer species with larger sample sizes, and certain driving paths with higher commonality can be omitted. In addition, the problem of unbalanced sample size is solved by using a relative weight calculation mode in the PGA-V method, but an artificial parameter is introduced, a large number of experiments are required for determining the parameter, and the process of the experiments is complicated, so that the expansion in practical application is not facilitated. However, the problem of sample size imbalance is not actually solved.
With the rapid development of deep learning, the problem of data imbalance is increasingly urgent. In 2012, a data enhancement strategy was proposed that generates additional data items that fit the distribution of real data by converting from existing real data. However, there is no better data generation method in the same period. Until 2014 Goodfellow et al proposed a powerful generation model based on game theory-generating an antagonism network (GANs). Although a large number of deep learning-based generative models were created in the early days, GANs was one of the most successful generative models and has been successfully applied to enhance data in various fields.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a method for identifying the common driving path of the cancer based on GAN sample balance, which is a useful tool for identifying the common driving path of the cancer, and has strong expansibility and practicability.
The technical scheme for realizing the aim of the invention is as follows:
A method for identifying a common driving path of a cancer cell based on GAN sample balance comprises the following steps:
1) Generating somatic mutation data corresponding to the true data distribution of the cancer:
1.1 Setting up an countermeasure generation network framework:
Assuming an example training set with m r samples, n r genes, the generator network of SNV-GANs is defined as G (z), the inputs of the generator are z-norm (0, 1), and the generator is defined as follows:
1.1.1 The input layer maps the noise vector z with GFC1 into a tensor zn of dimension (1, 128);
1.1.2 The hidden layer places the tensor zn in the step 1.1.1) into GFC2 for mapping, the obtained result is placed into GFC3 for mapping, the GFC4 is the same, and finally the tensor zn' is mapped into a dimension (1,1024);
1.1.3 The output layer maps tensor zn' to tensor gn of dimension (1, m r*nr) via GFC5 and resets gn to tensor of dimension (m r,nr) TensorAn output of the generator;
Wherein the input layer and the hidden layer both use dropout functions to freeze part of neurons, and adopt an activation function ReLU defined by a formula (1), the output layer uses an activation function Sigmod defined by a formula (2), a discriminator network is defined as D (x), and the input of the discriminator is real data x-P real or generated data X represents a set of somatic mutation data samples, and the discriminator is defined as follows:
1.1.4 The input layer maps x with DFC1 to a tensor xn of the dimension (m r, 256);
1.1.5 The hidden layer maps tensor xn of the step 1.1.4) into DFC2, the obtained result is mapped into DFC3, the DFC4 is the same, and finally the tensor xn' is mapped into dimension (m r, 16);
1.1.6 The output layer maps the tensor xn' to the tensor dn of the dimension (m r, 1), i.e. the output of the discriminator, by DFC5, wherein both the input layer and the hidden layer use the activation function ReLU defined by equation (1) and the output layer uses the activation function Sigmod defined by equation (2);
ReLU(x)=max(0,x) (1),
1.2 Training process of SNV-GANs):
1.2.1 Given a body cell mutation matrix A r(mr×nr) and a proportion parameter of randomly extracted samples (oc < 1), randomly extracting a submatrix M r with the number of samples of M, m=m r X oc from the matrix A r according to the proportion parameter (oc), constructing a training set X by extracting 64 submatrices M r in total, and inputting the training set X into a generated countermeasure network for training;
1.2.2 Initializing parameters θ d of discriminator D (), parameters θ g of generator G ();
1.2.3 Let current round epoch=1, randomly generate a1×100 gaussian distributed noise vector z;
1.2.4 Using z in step 1.2.3) as input to the generator to obtain a vector of size m×n r 1.2.5 Calculating a generator loss value according to equation (3), and then updating a parameter θ g of the generator:
Where G (z (i)) represents the generated data generated by the generator by the noise vector z (i), and D (G (z (i))) represents the probability that the discriminator generated data is determined to be true data, the smaller the loss G is, the better;
1.2.6 Randomly extracting a sample set X from the training set X;
1.2.7 Randomly generating a1 x 100 gaussian distributed noise vector z;
1.2.8 Using z in step 1.2.7) as input to the generator to obtain a vector of size m×n r 1.2.9 Calculating a generator loss value according to equation (4), and then updating a parameter θ g of the generator:
Where D (x (i)) represents the probability that the discriminator will determine that the generated data is true data, 1-D (G (z (i))) represents the probability that D will determine that the generated data is generated data, and the larger the loss D, the better;
1.2.10 Judging whether the current round epoch reaches the set maximum round, if so, stopping training, otherwise, returning to the step 1.2.3) to finally obtain a trained generator G ();
1.3 Data processing:
1.3.1 Randomly generating a1 x 100 gaussian distributed noise vector z;
1.3.2 Inputting the vector z in step 1.3.1) to a generator G (), obtained by training, resulting in generated data G data =g (z);
1.3.3 Setting the value of G data which is more than or equal to 0.85 as 1 and the value of G data which is less than 0.85 as 0 to obtain a new binary matrix A fakedata;
1.3.4 Taking the maximum sample size of cancers in the somatic mutation matrixes of R cancers as m max, 0< max < R, and inserting the matrix A r into an amplification matrix At this time, the amplification matrixThe number of samples is
1.3.5 If the number of samples to be amplified is requiredA sample number m r greater than matrix a r, performing step 1.3.6), if an amplified sample number is desiredA sample number m r smaller than matrix a r, step 1.3.8) is performed;
Step 1.3.6), randomly extracting a number m r of samples from A fakedata in step 1.3.3) Separately calculating matricesAnd mutation rate of each gene in the matrix A r to respectively obtain two corresponding mutation probability sets V and Q;
1.3.7 Inputting the two sets V and Q obtained in step 1.3.6) into JS divergence formula defined by formula (5) to obtain a distribution value, extracting matrix with smaller distribution value The more similar the mutation rate in matrix A r is, the less the distribution value is 0.09, i.e., matrixInsertion amplification matrixIn (2) and updateOtherwise, repeating the step 1.3.6);
1.3.8 Randomly extracting a sample from A fakedata in step 1.3.3) and directly adding the sample to the amplification matrix In (3), update
1.3.9 If the current amplification matrixIs a sample of (a)Equal to the maximum sample size m max, and the sample supplementation is finished, otherwise, the step 1.3.5) is performed to finally obtain an amplification matrix equal to the maximum sample size m max Re-matrix2) Model CDP-HA to minimize dispersion between individual total weights of cancer:
is provided with R, R.gtoreq.2 cancer types, for each of which a binary somatic mutation matrix is expressed as For recording whether a gene in a sample is mutated, it has m r rows and n r columns, the rows representing the sample or patient, the columns representing the gene, r=1, 2,3, r., a i- represents the ith sample in matrix A r, a -j represents the jth gene in matrix A r, and when mutation occurs in the jth gene of the ith sample in the mutation matrix of the nth cancer,OtherwiseGiven a set of gene sets S of size k,Representing a submatrix of size m r xk in the corresponding matrix a r,Representing a submatrixA sample in which the gene a is mutated,Representation ofThe total number of samples covered in (c) is used to measure the coverage of the gene set S, Overlapping the covered sample sum, and measuring the mutual exclusivity of the gene set S;
According to the definition of the symbol and the problem in the previous paragraph, a nonlinear maximization weight function model CDP-HA is constructed, wherein given m r rows and n r columns of binary somatic mutation matrixes A r of R cancer types, a parameter K is adopted, W C (S) is made to be the maximum weight and function, and an m multiplied by K submatrix is determined The specific formula (6) is as follows:
Wherein the method comprises the steps of Representing the absolute weight value of the gene set S in the r-th cancer species;
3) And (3) introducing a single parent genetic algorithm to solve the model CDP-HA:
3.1 Setting a fitness function:
Assuming that given chromosome E, let M E represent a sub-matrix corresponding to the chromosome, the size of matrix M E is m×K, the definition of the Fitness function Fitness (E) is shown in the following formula (7), and the larger the Fitness function value, the better the feasible solution scheme;
Fitness(E)=WC(ME) (7);
3.2 Setting a selection operator:
Adopting roulette selection and elite strategy to generate a new generation population, directly inheriting individuals with highest fitness from father to offspring, and then using a roulette selection operator to generate the rest N-1 individuals;
3.3 Setting a reorganization operator:
A recombination operator based on greedy strategy is adopted, and the steps are as follows: first, given a parent chromosome e= { E 1,e2,...,ek}(ei =1, 2, once again, n), wherein E i represents a gene number, and thus E is also referred to as a gene set, thereby determining a candidate gene set Then randomly deleting one gene from gene set E to obtain gene set E , finally selecting optimum gene from candidate set based on greedy strategyI.e.And generates the final new offspring
3.4 Setting parameters:
Inputting a somatic mutation matrix A r after the enhancement of R cancers, the gene number g number, the parameter k, the population size N, the algorithm execution times t and the maximum evolution algebra maxg;
3.5 Constructing an initial population:
The chromosomes are encoded in a decimal encoding mode, one chromosome represents an individual and is used for representing a solution vector of a problem, in a single parent genetic algorithm, a set of K genes is used as a problem solution, namely E= { E 1,e2,...,ek}(ei =1, 2, and the number n, the individual initialization method in the population is that natural data sets of 1 to n are randomly generated, each number represents one gene in a mutation matrix, and n genes are grouped in sequence, so that n/K gene sets S 1,S2,...,Sn/k are obtained. Order the Selecting the genes of the gene set S max to form an initial chromosome, generating an initial population by generating N initial chromosomes, selecting the first K numbers as an initial chromosome, generating an initial population pop 0, calculating the adaptive value of the pop 0 population chromosome, comparing the optimal chromosomes in the pop 0, storing the best individuals in a variable best, and enabling the initial iteration number step=0;
3.6 Performing an iterative operation:
3.6.1 If step > maxg, go to step 3.6.5), get the public drive channel with size K, otherwise go to step 3.6.2);
3.6.2 For the population pop step, firstly putting best chromosomes with highest fitness values in pop step into pop step+1, and then executing a roulette selection operator to select the rest N-1 chromosomes to put into pop step+1;
3.6.3 If step <0.7 x maxg or Fitness (E ') > Fitness (E), updating chromosome e=e', otherwise not updating, retaining E, step=step+1;
3.6.4 Taking the chromosome with the highest fitness value in the pop step+1, and if the fitness value of the chromosome is larger than that of the best chromosome, updating the best chromosome, namely the best chromosome with best=pop step+1;
3.6.5 The best chromosome is converted into a gene set, so that a submatrix M is obtained, the submatrix M is output, and the output M is a public driving channel S with the size of K.
According to the technical scheme, a countermeasure generation network (GENERATIVE ADVERSARIAL Networks, GAN for short) is used for generating samples of somatic mutation data of few-sample cancers, so that the samples among a plurality of cancer species reach balance, a mathematical model CDP-HA for accumulating absolute weight values of each cancer species by using a harmonic mean value is utilized, and finally a single-parent genetic algorithm is used for solving the model. For several types of cancers, the method provided by the technical scheme can effectively supplement cancers with small sample number, and solves the problem of sample number difference. Meanwhile, the gene set identified based on the proposed model is mutated not only in most samples of these cancers, but also in very close proportions of mutated samples in individual cancers. In addition, the method detects biologically significant gene sets that are deleted in other methods. Therefore, the sample data volume difference between different types of cancers can be reduced based on the countermeasure generation network, and a new thought is provided for identifying a common driving path of the cancers.
Compared with the prior art, the technical scheme has the following advantages:
(1) A non-linear, minimized dispersion maximization weighting function is designed to measure the relative weights of multiple cancer types.
(2) The data enhancement method SNV-GANs suitable for cancer somatic mutation data has high use value, and GANs is also applied to the cancer somatic mutation data for the first time.
(3) The common driving path of the cancer, which is found by the whole technical scheme, contains more genes which are enriched in the same important signal path, and the identified genes are enriched in more important signal paths.
The method can be used as a useful tool for enhancing somatic mutation data of biological cancers and a useful tool for identifying common driving paths of the cancers, can provide more biological information, has strong practicability before expansibility, can identify more genes enriched in important signal paths, and can identify the genes enriched in more important signal paths.
Drawings
FIG. 1 is a diagram showing an example of a common driving path in an embodiment;
FIG. 2 is an exemplary diagram of a single parent genetic algorithm in an embodiment;
FIG. 3 is a network model of SNV-GANs in an example;
FIG. 4 is a schematic diagram of a training process for SNV-GANs in an example;
FIG. 5 is a pseudo code of the SNV-GANs training process in the example;
Fig. 6 is a diagram showing an example of the actual effect of generating data in an example.
Detailed Description
The invention will now be described in further detail with reference to the drawings and specific examples, which are not intended to limit the invention thereto.
Examples:
In the experimental step 1), a Linux server (Intel (R) Xeon (R) Gold 6230.10 GHz CPU, the memory is 256G, the video memory is 32G) and the compiling operation environment is Python 3.7.9. Steps 2) and 3) are performed on a computer (Intel (R) Core (TM) i 5-6500.20 GHz CPU, memory is 32G), operating system is Windows 10, compiling running tool is Eclipse 4.23, and compiling environment Java 1.8.0.
This example is described with respect to the problem of common drive path identification for cancer.
A method for identifying a common driving path of a cancer cell based on GAN sample balance comprises the following steps:
Somatic mutation data a r:A1 (COADCORE has 95 samples, 211 genes), a 2 (BLCA has 95 samples, 211 genes), a 3 (LUAD has 95 samples, 211 genes) were assumed for three cancer types.
1) Generating somatic mutation data corresponding to the true data distribution of the cancer:
1.1 Setting up an countermeasure generation network framework:
Assuming that BLCA cancer somatic mutation data A 2 is used as an example training set, the generator network of SNV-GANs is defined as G (z), the inputs of the generator are z-norm (0, 1), and the generator is defined as follows, as shown in FIG. 3:
1.1.1 The input layer maps the noise vector z with GFC1 into a tensor zn of dimension (1, 128);
1.1.2 The hidden layer places the tensor zn in the step 1.1.1) into GFC2 for mapping, the obtained result is placed into GFC3 for mapping, the GFC4 is the same, and finally the tensor zn' is mapped into a dimension (1,1024);
1.1.3 The output layer maps tensor zn' to tensor gn of dimension (1, 95 x 211) via GFC5 and resets gn to tensor of dimension (95,211) TensorAn output of the generator;
Wherein the input layer and the hidden layer both use dropout functions to freeze part of neurons, and adopt an activation function ReLU defined by a formula (1), the output layer uses an activation function Sigmod defined by a formula (2), a discriminator network is defined as D (x), and the input of the discriminator is real data x-P real or generated data X represents a set of somatic mutation data samples, and the discriminator is defined as follows, as shown in fig. 3:
1.1.4 The input layer maps x with DFC1 to a tensor xn of dimension (95,256);
1.1.5 The hidden layer places the tensor xn of the step 1.1.4) into the DFC2 for mapping, the obtained result is placed into the DFC3 for mapping, the DFC4 is the same, and finally the tensor xn' is mapped into the dimension (95,16);
1.1.6 The output layer maps the tensor xn' to the tensor dn of the dimension (95,1), i.e., the output of the discriminator, by DFC5, wherein both the input layer and the hidden layer use the activation function ReLU defined by equation (1) and the output layer uses the activation function Sigmod defined by equation (2);
ReLU(x)=max(0,x) (1),
1.2 Training process of SNV-GANs):
1.2.1 Given a body cell mutation matrix a 2 (95×211) and a scaling parameter of randomly extracted samples ∈oc=0.7, randomly extracting a submatrix M 2 with the number of samples m=95×0.7≡67 from the matrix a r according to the scaling parameter ∈oc, and constructing a training set X by extracting 64 submatrices M 2 in total, and inputting the training set X into a generated countermeasure network, as shown in fig. 4;
1.2.2 Initializing parameters θ d of discriminator D (), parameters θ g of generator G ();
1.2.3 Let current round epoch=1, randomly generate a1×100 gaussian distributed noise vector z;
1.2.4 Using z in step 1.2.3) as input to the generator to obtain a vector of size 67 x 211
1.2.5 Calculating a generator loss value according to equation (3), and then updating a parameter θ g of the generator:
Where G (z (i)) represents the generated data generated by the generator by the noise vector z (i), and D (G (z (i))) represents the probability that the generator determines the generated data as real data, the smaller the loss G is, the better;
1.2.6 Randomly extracting a sample set X from the training set X;
1.2.7 Randomly generating a1 x 100 gaussian distributed noise vector z;
1.2.8 Using z in step 1.2.7) as input to the generator to obtain a vector of size 95 x 211
1.2.9 Calculating a generator loss value according to equation (4), and then updating a parameter θ g of the generator:
Where D (x (i)) represents the probability that the discriminator will determine that the generated data is true data, 1-D (G (z (i))) represents the probability that D will determine that the generated data is generated data, and the larger the loss D, the better;
1.2.10 Judging whether the current cycle epoch reaches 10000 times, if so, stopping training, otherwise, returning to the step 1.2.3) to finally obtain a trained generator G ();
1.3 Data processing:
1.3.1 Randomly generating a1 x 100 gaussian distributed noise vector z;
1.3.2 Inputting the vector z in step 1.3.1) to a generator G (), obtained by training, resulting in generated data G data =g (z);
1.3.3 Setting the value of G data which is more than or equal to 0.85 as 1 and the value of G data which is less than 0.85 as 0 to obtain a new binary matrix A fakedata;
1.3.4 Taking the maximum sample size of cancer in 3 cancer somatic mutation matrixes, namely m max=m1 =489, and inserting matrix A 2 into an amplification matrix At this time, the amplification matrixThe number of samples is
1.3.5 If the number of samples to be amplified is requiredA sample number m 2 =95 greater than matrix a 2, performing step 1.3.6), if an amplified sample number is requiredA sample number m 2 =95 smaller than the matrix a 2, step 1.3.8 is performed;
1.3.6 Randomly extracting a number of m 2 samples from A fakedata in step 1.3.3) Separately calculating matricesAnd mutation rate of each gene in matrix A 2 to obtain two corresponding mutation probability sets V and Q respectively, 1.3.7) inputting the two sets V and Q obtained in step 1.3.6) into JS scattering formula defined by formula (5) to obtain a distribution value, extracting matrix as the distribution value is smallerThe more similar the mutation rate in matrix A r is, the less the distribution value is 0.09, i.e., matrixInsertion amplification matrixIn (2) and updateOtherwise, repeating the step 1.3.6);
1.3.8 Randomly extracting a sample from A fakedata in the step 1.3.3) and directly adding the sample into the matrix A r to update
1.3.9 If the current amplification matrixIs a sample of (a)Equal to the maximum sample size m max =489, and the sample supplementation is finished, otherwise, step 1.3.5) is performed to finally obtain a matrix equal to the maximum sample size m max =489Obtaining a new amplification matrixIs substantially identical to the mutation rate of the original matrix A 2, and then the matrix is madeTraining process pseudocode as in FIG. 5;
2) Model CDP-HA to minimize dispersion between individual total weights of cancer:
There are r=3 cancer types, for each of which a binary somatic mutation matrix is expressed as For recording whether a gene in a sample is mutated, it has m r rows and n r columns, the rows representing the sample or patient, the columns representing the gene, r=1, 2,3, r., a i- represents the ith sample in matrix A r, a -j represents the jth gene in matrix A r, and when mutation occurs in the jth gene of the ith sample in the mutation matrix of the nth cancer,OtherwiseGiven a set of gene sets S of size k,Representing a submatrix of size m r xk in the corresponding matrix a r,Representing a submatrixA sample in which the gene a is mutated,Representation ofThe total number of samples covered in (c) is used to measure the coverage of the gene set S, Overlapping the covered sample sum, and measuring the mutual exclusivity of the gene set S;
According to the definition, a nonlinear maximization weight function model CDP-HA is constructed, wherein given m r rows and n r columns of binary somatic mutation matrixes A r of R cancer types, a parameter K (0<K is less than or equal to 10) is adopted, W C (S) is made to be the maximum weight and function, and an m multiplied by K submatrix is determined The specific formula (6) is as follows:
Wherein the method comprises the steps of Representing the absolute weight of the gene set S in the r cancer species, as shown in FIG. 1, is a mutation matrix of three cancers, with two submatrices S 1 and S 2, the size of the scale K is 3, and can be obtained according to the model proposed in the exampleThe weight of S 2 is found to be higher, and S 2 is more in line with the commonality required by the common driving path;
3) And (3) introducing a single parent genetic algorithm to solve the model CDP-HA:
3.1 Setting a fitness function:
Assuming that given chromosome E, let M E represent a sub-matrix corresponding to the chromosome, the size of matrix M E is m×K, the definition of the Fitness function Fitness (E) is shown in the following formula (7), and the larger the Fitness function value, the better the feasible solution scheme;
Fitness(E)=WC(ME) (7);
3.2 Setting a selection operator:
Adopting roulette selection and elite strategy to generate a new generation population, directly inheriting individuals with highest fitness from father to offspring, and then using a roulette selection operator to generate the rest N-1 individuals;
3.3 Setting a reorganization operator:
A recombination operator based on greedy strategy is adopted, and the steps are as follows: first, given a parent chromosome e= { E 1,e2,...,ek}(ei =1, 2, once again, n), wherein a gene number is represented, thus E is also referred to as a gene set, thereby determining a candidate gene set Then randomly deleting one gene from gene set E to obtain gene set E , finally selecting optimum gene from candidate set based on greedy strategyI.e.And generates the final new offspring
3.4 Setting parameters:
Inputting a somatic mutation matrix A r after 3 kinds of cancer enhancement, wherein the gene number g number =211, the parameter k=3, the population size N=20, the algorithm execution times step=10 and the maximum evolution algebra maxg =1000;
3.5 Constructing an initial population:
The chromosomes are encoded in a decimal encoding mode, one chromosome represents an individual and is used for representing a solution vector of a problem, in a single parent genetic algorithm, a set consisting of K=3 genes is used as a problem solution, namely E= { E 1,e2,...,ek}(ei =1, 2..the number n), and the individual in the population is initialized by randomly generating natural data sets of 1 to 20, wherein each number represents one gene in a mutation matrix, and the 20 genes are grouped in sequence to obtain n/k=20/3 approximately 7 gene sets S 1,S2,...,Sn/k. Order the Selecting the genes of the gene set S max to form initial chromosomes, and generating an initial population by generating N initial chromosomes, wherein the first K=3 numbers are selected as an initial chromosome to generate an initial population pop 0, the population size is N, the adaptive value of the chromosome of the pop 0 population is calculated, the optimal chromosomes in the pop 0 are compared, the best individuals are stored in a variable best, and the initial iteration number step=0 is shown in fig. 2;
3.6 Performing an iterative operation:
3.6.1 If step > maxg, go to step 3.6.5), get the public drive channel with size K, otherwise go to step 3.6.2);
3.6.2 For population pop step, firstly putting best chromosome with highest Fitness value in pop step into pop step+1, then executing roulette selection operator to select the rest N-1=20-1=19 chromosomes to put into pop step+1, 3.6.3) if step <700 or Fitness (E ') > Fitness (E), updating chromosome E=E', otherwise not updating, retaining E, step=step+1;
3.6.4 Taking the chromosome with the highest fitness value in the pop step+1, and if the fitness value of the chromosome is larger than that of the best chromosome, updating the best chromosome, namely the best chromosome with best=pop step+1;
3.6.5 The best chromosome is converted into a gene set, so that a submatrix M is obtained, the submatrix M is output, and the output M is a public driving channel S with the size of K=3.

Claims (1)

1.一种基于GAN样本平衡的泛癌公共驱动通路识别方法,其特征在于,包括如下步骤:1. A pan-cancer common driver pathway identification method based on GAN sample balance, characterized by comprising the following steps: 1)生成对应癌症的符合真实数据分布的体细胞突变数据:1) Generate somatic mutation data corresponding to cancer that conforms to the real data distribution: 1.1)设置对抗生成网络框架:1.1) Set up the adversarial generation network framework: 假设一个具有mr个样本,nr个基因的示例训练集,SNV-GANs的生成器网络定义为G(z),生成器的输入是:z~norm(0,1),生成器的定义如下:Assume an example training set with m r samples and n r genes. The generator network of SNV-GANs is defined as G(z). The input of the generator is: z~norm(0,1). The definition of the generator is as follows: 1.1.1)输入层用GFC1将噪声向量z映射为维度(1,128)的张量zn;1.1.1) The input layer uses GFC1 to map the noise vector z into a tensor zn of dimension (1,128); 1.1.2)隐藏层将步骤1.1.1)的张量zn放入GFC2进行映射,得到的结果放入GFC3进行映射,GFC4同理,最终映射为维度(1,1024)张量zn′;1.1.2) The hidden layer puts the tensor zn in step 1.1.1) into GFC2 for mapping, and the result is put into GFC3 for mapping. The same is true for GFC4, and finally it is mapped into a tensor zn′ of dimension (1,1024); 1.1.3)输出层通过GFC5将张量zn′映射为维度(1,mr*nr)的张量gn,再将gn重置成维度为(mr,nr)的张量张量为生成器的输出;1.1.3) The output layer maps the tensor zn′ to a tensor gn of dimension (1,m r *n r ) through GFC5, and then resets gn to a tensor of dimension (m r ,n r ) Tensor is the output of the generator; 其中输入层和隐藏层均使用了dropout函数对部分神经元进行冻结,并采用由式(1)定义的激活函数ReLU,输出层使用了由式(2)定义的激活函数Sigmod,辨别器网络定义为D(x),辨别器的输入是真实数据x~Preal或者生成数据x代表一组体细胞突变数据样本,辨别器定义如下:The input layer and the hidden layer both use the dropout function to freeze some neurons, and use the activation function ReLU defined by formula (1). The output layer uses the activation function Sigmod defined by formula (2). The discriminator network is defined as D(x). The input of the discriminator is real data x~P real or generated data x represents a set of somatic mutation data samples, and the discriminator is defined as follows: 1.1.4)输入层用DFC1将x映射为维度(mr,256)的张量xn;1.1.4) The input layer uses DFC1 to map x into a tensor xn of dimension (m r ,256); 1.1.5)隐藏层将步骤1.1.4)的张量xn放入DFC2进行映射,得到的结果放入DFC3进行映射,DFC4同理,最终映射为维度(mr,16)张量xn′;1.1.5) The hidden layer puts the tensor xn in step 1.1.4) into DFC2 for mapping, and the result is put into DFC3 for mapping. The same is true for DFC4, and finally it is mapped into a tensor xn′ of dimension (m r ,16); 1.1.6)输出层通过DFC5将张量xn′映射为维度(mr,1)的张量dn,即辨别器的输出;1.1.6) The output layer maps the tensor xn′ to a tensor dn of dimension (m r ,1) through DFC5, which is the output of the discriminator; 其中输入层和隐藏层均使用了由式(1)定义的激活函数ReLU,输出层使用了由式(2)定义的激活函数Sigmod;The input layer and hidden layer both use the activation function ReLU defined by formula (1), and the output layer uses the activation function Sigmod defined by formula (2); ReLU(x)=max(0,x) (1),ReLU(x)=max(0,x) (1), 1.2)SNV-GANs的训练过程:1.2) Training process of SNV-GANs: 1.2.1)给定一个体细胞突变矩阵Ar(mr×nr)和一个随机抽取样本的比例参数∝,∝<1,按照比例参数∝从矩阵Ar随机抽取样本数量为m,m=mr*∝的子矩阵Mr,大小为m×nr,一共抽取64个子矩阵Mr构造成训练集X,并输入到生成对抗网络中训练;1.2.1) Given a somatic mutation matrix A r (m r ×n r ) and a random sample ratio parameter ∝, ∝<1, randomly extract a submatrix M r with m samples, m=m r *∝, and size m×n r from the matrix A r according to the ratio parameter ∝. A total of 64 submatrices M r are extracted to construct the training set X, and input into the generative adversarial network for training; 1.2.2)初始化辨别器D(.)的参数θd、生成器G(.)的参数θg1.2.2) Initialize the parameters θ d of the discriminator D(.) and the parameters θ g of the generator G(.); 1.2.3)令当前轮次epoch=1,随机生成一个1×100的高斯分布的噪声向量z;1.2.3) Set the current round epoch = 1, and randomly generate a 1×100 Gaussian distributed noise vector z; 1.2.4)将步骤1.2.3)中的z作为生成器的输入,得到一个大小为m×nr的向量 1.2.4) Take z from step 1.2.3) as the input of the generator to get a vector of size m×n r 1.2.5)根据式(3)计算生成器损失值,然后更新生成器的参数θg1.2.5) Calculate the generator loss value according to formula (3), and then update the generator parameters θ g : 其中G(z(i))表示生成器通过噪声向量z(i)生成的生成数据,D(G(z(i)))表示辨别器将生成数据判定为真实数据的概率,lossG越小越好;Where G(z (i) ) represents the generated data generated by the generator through the noise vector z (i) , D(G(z (i) )) represents the probability that the discriminator determines the generated data as real data, and the smaller the loss G, the better; 1.2.6)从训练集X中随机抽取一个样本组x;1.2.6) Randomly select a sample group x from the training set X; 1.2.7)随机生成一个1×100的高斯分布的噪声向量z;1.2.7) Randomly generate a 1×100 Gaussian distributed noise vector z; 1.2.8)将步骤1.2.7)中的z作为生成器的输入,得到一个大小为m×nr的向量 1.2.8) Take z from step 1.2.7) as the input of the generator to get a vector of size m×n r 1.2.9)根据式(4)计算生成器损失值,然后更新生成器的参数θg1.2.9) Calculate the generator loss value according to formula (4), and then update the generator parameters θ g : 其中D(x(i))表示辨别器将生成数据判定为真实数据的概率,1-D(G(z(i)))表示D将生成数据判定为生成数据的概率,lossD越大越好;Where D(x (i) ) represents the probability that the discriminator determines the generated data as real data, 1-D(G(z (i) )) represents the probability that D determines the generated data as generated data, and the larger the loss D, the better; 1.2.10)判断当前轮次epoch是否达到设定的最大轮次:若是,则停止训练;否则,返回步骤1.2.3)最终得到训练好的生成器G(.);1.2.10) Determine whether the current epoch has reached the set maximum epoch: if so, stop training; otherwise, return to step 1.2.3) to finally obtain the trained generator G(.); 1.3)数据处理:1.3) Data processing: 1.3.1)随机生成一个1×100的高斯分布的噪声向量z;1.3.1) Randomly generate a 1×100 Gaussian distributed noise vector z; 1.3.2)将步骤1.3.1)中的向量z输入到通过训练得到的生成器G(.),得到生成数据Gdata=G(z);1.3.2) Input the vector z in step 1.3.1) into the generator G(.) obtained through training to obtain the generated data G data =G(z); 1.3.3)将Gdata中大于等于0.85的值置为1,小于0.85的值置为0,得到一个新的二元矩阵Afakedata1.3.3) Set the values in G data greater than or equal to 0.85 to 1, and the values less than 0.85 to 0, to obtain a new binary matrix A fakedata ; 1.3.4)取R种癌症体细胞突变矩阵中的样本数量最大的癌症的样本量为mmax,0<max<R;再将矩阵Ar插入扩增矩阵此时,扩增矩阵的样本数量为1.3.5)若需要扩增的样本数量大于矩阵Ar的样本数量mr,执行步骤1.3.6),若需要扩增的样本数量小于矩阵Ar的样本数量mr,执行步骤1.3.8);1.3.4) Take the sample size of the cancer with the largest number of samples in the R cancer somatic mutation matrix as m max , 0<max<R; then insert the matrix A r into the amplification matrix At this point, the amplification matrix The sample size is 1.3.5) If the number of samples to be amplified If the number of samples m r is greater than the number of samples in the matrix A r , execute step 1.3.6). If the number of samples to be expanded is If the number of samples m r is less than the number of samples in the matrix A r , execute step 1.3.8); 1.3.6)从步骤1.3.3)中的Afakedata中随机抽取数量为mr的样本分别计算矩阵和矩阵Ar中每个基因的突变率,分别得到两个对应突变概率集合V和Q;1.3.6) Randomly extract m r samples from A fakedata in step 1.3.3) Calculate the matrix separately and the mutation rate of each gene in the matrix A r , and obtain two corresponding mutation probability sets V and Q; 1.3.7)将步骤1.3.6)中得到的两个集合V和Q输入由式(5)定义的JS散度公式得出一个分布值,分布值越小抽取矩阵与矩阵Ar中的突变率越相似,所以分布值小于等于0.09,即将矩阵插入扩增矩阵中,并更新反之则重复步骤1.3.6);1.3.7) Input the two sets V and Q obtained in step 1.3.6) into the JS divergence formula defined by formula (5) to obtain a distribution value. The smaller the distribution value, the smaller the extraction matrix The more similar the mutation rate is to the matrix A r , the distribution value is less than or equal to 0.09, that is, the matrix Insertion Amplification Matrix and update Otherwise, repeat step 1.3.6); 1.3.8)从步骤1.3.3)中的Afakedata中随机抽取一个样本直接加入扩增矩阵中,更新 1.3.8) Randomly select a sample from fakedata A in step 1.3.3) and add it directly to the amplification matrix Update 1.3.9)若当前扩增矩阵的样本量等于最大样本量mmax,样本补充结束;反之,执行步骤1.3.5)最终得到一个与最大样本量mmax相等的的扩增矩阵再使矩阵2)最小化各个癌症总权重之间离散度的模型CDP-HA:1.3.9) If the current amplification matrix Sample size If the maximum sample size m max is equal to the maximum sample size, the sample supplementation is completed; otherwise, execute step 1.3.5) to finally obtain an amplification matrix equal to the maximum sample size m max Then make the matrix 2) Model CDP-HA that minimizes the dispersion between the total weights of each cancer: 设有R,R≥2种癌症类型,对于每种癌症类型,一个二进制体细胞突变矩阵表示为用于记录样本中的基因是否突变,其具有mr行和nr列,行代表样本或者患者,列代表基因,r=1,2,3,..,R.,ai-表示矩阵Ar中的第i个样本,a-j表示矩阵Ar中的第j个基因,在第r种癌症的突变矩阵中的第i个样本的第j个基因发生突变时,否则给定一组大小为k的基因集S,表示对应矩阵Ar中大小为mr×k的子矩阵,表示子矩阵中基因a发生突变的的样本,表示中覆盖的样本总数,用于衡量基因集S的覆盖度,重叠覆盖的样本总和,衡量基因集S的互斥度;Suppose R, R ≥ 2 cancer types, for each cancer type, a binary somatic mutation matrix is represented as It is used to record whether the gene in the sample is mutated. It has m r rows and n r columns. The rows represent samples or patients, and the columns represent genes. r = 1, 2, 3, .., R. a i- represents the i-th sample in the matrix A r , and a -j represents the j-th gene in the matrix A r . When the j-th gene of the i-th sample in the mutation matrix of the r-th cancer mutates, otherwise Given a gene set S of size k, represents the submatrix of size m r ×k in the corresponding matrix A r , Represents a submatrix Samples with mutations in gene a, express The total number of samples covered in is used to measure the coverage of gene set S. The sum of samples with overlapping coverage measures the mutual exclusivity of gene set S; 根据上一段符号和问题的定义,构造了非线性最大化权重函数模型CDP-HA:给定R种癌症类型的mr行,nr列的二元体细胞突变矩阵Ar,一个参数K,令WC(S)为最大权重和函数,确定一个m×K的子矩阵具体公式(6)如下:According to the definition of the symbols and problems in the previous paragraph, a nonlinear maximization weight function model CDP-HA is constructed: given a binary somatic mutation matrix A r with m r rows and n r columns for R cancer types, a parameter K, let W C (S) be the maximum weight sum function, and determine an m×K submatrix The specific formula (6) is as follows: 其中表示基因集S在第r个癌种中的绝对权重值;3)引入单亲遗传算法对该模型CDP-HA进行求解:in represents the absolute weight value of gene set S in the rth cancer type; 3) introducing a parthenogenetic algorithm to solve the model CDP-HA: 3.1)设定适应度函数:3.1) Set the fitness function: 假设给定染色体E,令ME代表与染色体对应的一个子矩阵,矩阵ME的规模大小为m×K,个体适应度函数Fitness(E)的定义如下公式(7)所示,个体适应度函数值越大,代表可行解方案越好;Assume that a chromosome E is given, let ME represent a submatrix corresponding to the chromosome, the size of the matrix ME is m×K, and the definition of the individual fitness function Fitness(E) is as shown in the following formula (7). The larger the value of the individual fitness function, the better the feasible solution; Fitness(E)=WC(ME) (7);Fitness(E)=W C (M E ) (7); 3.2)设定选择算子:3.2) Set the selection operator: 采用轮盘赌选择和精英策略来产生新一代种群,适应度最高的个体直接从父代遗传到子代,然后运用轮盘赌选择算子来生成其余N-1个个体;Roulette wheel selection and elite strategy are used to generate a new generation of population. The individuals with the highest fitness are directly inherited from the parent generation to the offspring, and then the roulette wheel selection operator is used to generate the remaining N-1 individuals. 3.3)设定重组算子:3.3) Set the reorganization operator: 采用一种基于贪心策略的重组算子,步骤如下:首先,给定一个父代染色体E={e1,e2,...,ek},ei=1,2,...,n,其中ei代表一个基因序号,因此E也称为基因集,由此确定候选基因集合其次,从基因集E中随机删除一个基因,得到基因集E;最后,基于贪心策略,从候选集合中选出最佳基因并产生最终的新后代 A recombination operator based on a greedy strategy is used. The steps are as follows: First, given a parent chromosome E = {e 1 ,e 2 ,...,e k }, e i = 1, 2,..., n, where e i represents a gene number, so E is also called a gene set, and the candidate gene set is determined Secondly, a gene is randomly deleted from gene set E to obtain gene set E ; finally, the best gene is selected from the candidate set based on the greedy strategy. Right now And eventually produce new offspring 3.4)设定参数:3.4) Set parameters: 输入R种癌症增强后的体细胞突变矩阵Ar,基因数gnumber,参数k,种群规模N,算法执行次数t,最大演化代数maxg;Input the somatic mutation matrix A r of R cancer enhancements, the number of genes g number , the parameter k , the population size N , the number of algorithm executions t , and the maximum number of evolution generations maxg ; 3.5)构造初始种群:3.5) Construct the initial population: 染色体用十进制的编码方式进行编码,一条染色体就代表一个个体,被用来代表问题的解向量,在单亲遗传算法中,用K个基因组成的集合作为一个问题解,即E={e1,e2,...,ek},ei=1,2,...,n,种群中的个体初始化的方法为:随机产生1至n的自然数据集,每个数字表示突变矩阵中的一个基因,按顺序对n个基因进行分组,获得n/k个基因集S1,S2,...,Sn/k,令选择基因集Smax的基因构成初始染色体,通过产生N个初始染色体来产生初始种群;选取前K个数字作为一条初始染色体,生成一个初始种群pop0,种群大小为N,计算pop0种群染色体的适应值,将pop0中最优的染色体相比较,保存最好的个体到变量best中,初始迭代次数step=0;Chromosomes are encoded in decimal code. One chromosome represents one individual and is used to represent the solution vector of the problem. In the parthenogenetic algorithm, a set of K genes is used as a solution, that is, E = {e 1 , e 2 , ..., e k }, e i = 1, 2, ..., n. The method of initializing individuals in the population is: randomly generate natural data sets from 1 to n, each number represents a gene in the mutation matrix, and group the n genes in order to obtain n/k gene sets S 1 , S 2 , ..., Sn /k , let Select the genes of gene set S max to form the initial chromosome, and generate the initial population by generating N initial chromosomes; select the first K numbers as an initial chromosome to generate an initial population pop 0 with a population size of N, calculate the fitness value of the chromosome of the pop 0 population, compare the best chromosome in pop 0 , save the best individual to the variable best, and the initial iteration number step = 0; 3.6)执行迭代操作:3.6) Perform iterative operations: 3.6.1)若step>maxg,转入步骤3.6.5),得到大小为K的公共驱动通路,否则转入步骤3.6.2);3.6.1) If step>maxg, go to step 3.6.5) to obtain a common driving path of size K, otherwise go to step 3.6.2); 3.6.2)对于种群popstep,首先将popstep中适应值最高的best染色体放入popstep+1,然后执行轮盘赌选择算子选出其余N-1个染色体放入popstep+13.6.2) For the population pop step , first put the best chromosome with the highest fitness value in pop step into pop step+1 , then execute the roulette wheel selection operator to select the remaining N-1 chromosomes and put them into pop step+1 ; 3.6.3)若step<0.7*maxg或Fitness(E′)>Fitness(E),更新染色体E=E′,否则不更新,保留X,step=step+1;3.6.3) If step<0.7*maxg or Fitness(E′)>Fitness(E), update chromosome E=E′, otherwise do not update, keep X, step=step+1; 3.6.4)取popstep+1中适应值最高的染色体,若该染色体适应值大于best染色体的适应值,则更新best染色体,即best=popstep+1的最优染色体;3.6.4) Take the chromosome with the highest fitness value in pop step+1 . If the fitness value of this chromosome is greater than the fitness value of the best chromosome, then update the best chromosome, that is, best = the optimal chromosome of pop step+1 ; 3.6.5)将best染色体转换为基因集,由此得到子矩阵M,并将子矩阵M输出,输出的M即为大小为K的公共驱动通路S。3.6.5) The best chromosome is converted into a gene set, thereby obtaining a submatrix M, and the submatrix M is output. The output M is the common driving pathway S of size K.
CN202211581374.8A 2022-12-09 2022-12-09 A pan-cancer common driver pathway identification method based on GAN sample balance Active CN115762646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211581374.8A CN115762646B (en) 2022-12-09 2022-12-09 A pan-cancer common driver pathway identification method based on GAN sample balance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211581374.8A CN115762646B (en) 2022-12-09 2022-12-09 A pan-cancer common driver pathway identification method based on GAN sample balance

Publications (2)

Publication Number Publication Date
CN115762646A CN115762646A (en) 2023-03-07
CN115762646B true CN115762646B (en) 2025-02-14

Family

ID=85344971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211581374.8A Active CN115762646B (en) 2022-12-09 2022-12-09 A pan-cancer common driver pathway identification method based on GAN sample balance

Country Status (1)

Country Link
CN (1) CN115762646B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326915A (en) * 2016-08-10 2017-01-11 北京理工大学 Improved-Fisher-based chemical process fault diagnosis method
CN108490204A (en) * 2011-09-25 2018-09-04 赛拉诺斯知识产权有限责任公司 System and method for multiple analysis

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10685738B1 (en) * 2017-09-19 2020-06-16 Quantigic Genomics LLC Cancer diagnostic tool using cancer genomic signatures to determine cancer type
AU2020248338A1 (en) * 2019-03-28 2021-11-18 Phase Genomics, Inc. Systems and methods for karyotyping by sequencing
WO2020234729A1 (en) * 2019-05-17 2020-11-26 Insilico Medicine Ip Limited Deep proteome markers of human biological aging and methods of determining a biological aging clock
WO2022058980A1 (en) * 2020-09-21 2022-03-24 Insilico Medicine Ip Limited Methylation data signatures of aging and methods of determining a methylation aging clock
CN112270952B (en) * 2020-10-30 2022-04-05 广西师范大学 A method to identify cancer driver pathways
CN114023383A (en) * 2021-11-04 2022-02-08 广西师范大学 Non-parameter nonlinear intelligent optimization method for identifying cancer drive path
CN115359839A (en) * 2022-08-17 2022-11-18 广西师范大学 A CPGA-SMCMN approach to identify single driver pathways in cancer

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108490204A (en) * 2011-09-25 2018-09-04 赛拉诺斯知识产权有限责任公司 System and method for multiple analysis
CN106326915A (en) * 2016-08-10 2017-01-11 北京理工大学 Improved-Fisher-based chemical process fault diagnosis method

Also Published As

Publication number Publication date
CN115762646A (en) 2023-03-07

Similar Documents

Publication Publication Date Title
Li et al. An overview of SNP interactions in genome-wide association studies
Morlon et al. Inferring the dynamics of diversification: a coalescent approach
Excoffier et al. Robust demographic inference from genomic and SNP data
EP2430441B1 (en) Method and system for calling variations in a sample polynucleotide sequence with respect to a reference polynucleotide sequence
Bansal et al. Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations
Kumar et al. Evolutionary sparse learning for phylogenomics
Ruffieux et al. A global-local approach for detecting hotspots in multiple-response regression
Emily A survey of statistical methods for gene-gene interaction in case-control genome-wide association studies
Medvedev et al. Human genotype-to-phenotype predictions: Boosting accuracy with nonlinear models
Bisschop et al. Sweeps in time: leveraging the joint distribution of branch lengths
Yang et al. Nonparametric functional mapping of quantitative trait loci
Ray et al. Introunet: identifying introgressed alleles via semantic segmentation
Woodhams et al. Simulating and summarizing sources of gene tree incongruence
Gaynor et al. nQuack: An R package for predicting ploidal level from sequence data using site‐based heterozygosity
McKibben et al. Applying machine learning to classify the origins of gene duplications
CN119446290A (en) SNP prediction method for wheat abiotic stress traits based on automated CNN model
CN115762646B (en) A pan-cancer common driver pathway identification method based on GAN sample balance
Avadhanam et al. Simultaneous inference of parental admixture proportions and admixture times from unphased local ancestry calls
CN116959561B (en) A method and device for predicting gene interaction based on neural network model
EP3239875A1 (en) Method for determining genotype of particular gene locus group or individual gene locus, determination computer system and determination program
Kasianov et al. Interspecific comparison of gene expression profiles using machine learning
Lall et al. Generating realistic cell samples for gene selection in scRNA-seq data: A novel generative framework
Alizadeh et al. AICRF: ancestry inference of admixed population with deep conditional random field
Lewis Schmalohr et al. Detection of epistatic interactions with Random Forest
Liu et al. A comprehensive overview and benchmarking analysis of fast algorithms for genome-wide association studies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant